Thread: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
"Jignesh K. Shah"
Date:
Hello,

I am running tests with PG8.3b2 on Solaris 10 8/07 and I still see IO 
flood when checkpoint happens.
I have tried increasing the bg_lru_multiplier from 2 to 5 from default 
but I dont see any more writes by bgwriter happening than my previous 
test which used the default.

Then I tried increasing checkpoint_completion_target=0.9 but still no 
spread of IO (checkpoint_timeout is set to default 5m)

What am I missing?

How does PostgreSQL determine the Load distribution?

Regards,
Jignesh




Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Heikki Linnakangas
Date:
Jignesh K. Shah wrote:
> I am running tests with PG8.3b2 on Solaris 10 8/07 and I still see IO 
> flood when checkpoint happens.
> I have tried increasing the bg_lru_multiplier from 2 to 5 from default 
> but I dont see any more writes by bgwriter happening than my previous 
> test which used the default.
> 
> Then I tried increasing checkpoint_completion_target=0.9 but still no 
> spread of IO (checkpoint_timeout is set to default 5m)
> 
> What am I missing?

Two things spring to mind:

Even though the write()s are distributed, the fsyncs are still going to 
cause a spike. But it should be much less severe than before. How bad 
are the spikes you're seeing? Compared to 8.2?

Are your checkpoints triggered by checkpoint_timeout or 
checkpoint_segments? The calculation of "how much time do I have to 
finish this checkpoint before the next one is triggered" takes both into 
account, but the calculation wrt. checkpoint_segments is not very 
accurate because of full page writes. Because of full page writes, we 
write a lot more WAL right after checkpoint. That makes the load 
distribution algorithm to think it's going to run out of 
checkpoint_segments much sooner than it actually does. Raising 
checkpoint_segments will help with that.

BTW, please turn on checkpoint_logging.

> How does PostgreSQL determine the Load distribution?

First, when the checkpoint starts, shared buffer pool is scanned, dirty 
buffers are counted. Then bgwriter starts to write the buffers, 
constantly estimating how much work it has left, and how much time it 
has until next checkpoint. The write rate is throttled so that the 
remaining writes are distributed evenly across the time remaining 
(checkpoint_completion_target is a fuzz-factor applied to the estimate 
of how much time is remaining).

Hope this helps..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Tom Lane
Date:
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
> I am running tests with PG8.3b2 on Solaris 10 8/07 and I still see IO 
> flood when checkpoint happens.

I am thinking that you are probably trying to test that by issuing
manual CHECKPOINT commands.  A manual checkpoint is still done at full
speed, as are shutdown checkpoints.  You need to study the behavior of
automatic (background) checkpoints, instead.
        regards, tom lane


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
"Jignesh K. Shah"
Date:
I am running EAStress workload.. which doesnt do manual checkpoints as 
far as I know..

I will turn on checkpoint_logging to get more idea as Heikki suggested

thanks.

-Jignesh


Tom Lane wrote:
> "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
>   
>> I am running tests with PG8.3b2 on Solaris 10 8/07 and I still see IO 
>> flood when checkpoint happens.
>>     
>
> I am thinking that you are probably trying to test that by issuing
> manual CHECKPOINT commands.  A manual checkpoint is still done at full
> speed, as are shutdown checkpoints.  You need to study the behavior of
> automatic (background) checkpoints, instead.
>
>             regards, tom lane
>   


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Greg Smith
Date:
On Tue, 13 Nov 2007, Jignesh K. Shah wrote:

> I have tried increasing the bg_lru_multiplier from 2 to 5 from default but I 
> dont see any more writes by bgwriter happening than my previous test which 
> used the default.

The multiplier only impacts writes being done by the LRU eviction 
mechanism; it doesn't change anything related to the checkpoint writes. 
In fact, since those LRU writes are going on at the same time as the 
checkpoint ones, increasing the multiplier too much can make the 
checkpoint I/O spike worse.  It's unlikely that higher values will 
decrease the spike.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
ITAGAKI Takahiro
Date:
"Jignesh K. Shah" <J.K.Shah@Sun.COM> wrote:

> I am running tests with PG8.3b2 on Solaris 10 8/07 and I still see IO 
> flood when checkpoint happens.

Are there any i/o tuning knobs in Solaris? LDC in 8.3 expects writing
activity in kernel is strong enough to keep dirty pages in kernel
in a small number. i.e, fsync should be cheap.

Larger shared_buffers, checkpoint_timeout, checkpoint_segments and
checkpoint_completion_target make checkpoints more smooth.
But there is a trade-off between smoothness and recovery time.

Stronger writer processes in kernel make checkpoints more smooth.
But there is a trade-off between smoothness and throughput.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Tom Lane
Date:
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
> I will turn on checkpoint_logging to get more idea as Heikki suggested

Did you find out anything?

Did this happen on every checkpoint, or only some of them?  The bug
Itagaki-san pointed out today in IsCheckpointOnSchedule might account
for some checkpoints being done at full speed, but not all ...
        regards, tom lane


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
"Jignesh K. Shah"
Date:
I was waiting to digest  what I saw before sending it to the group

I am running EAStress workload

I am using odata_sync which should sync as soon as it is written

with checkpoint_completion_target=0.9 and checkpoint_time=5m it seems to 
be doing the right thing from the logfile output


2007-11-13 09:20:49.070 PST 9180  LOG:  checkpoint starting: time
2007-11-13 09:21:13.808 PST 9458  LOG:  automatic analyze of table 
"specdb.public.o_orderline" system usage: CPU 0.03s/0.50u sec elapsed 
7.79 sec
2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table 
"specdb.public.txn_log_table": index scans: 1       pages: 11 removed, 105 remain       tuples: 3147 removed, 40 remain
     system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec
 
2007-11-13 09:22:12.112 PST 9462  LOG:  automatic vacuum of table 
"specdb.public.txn_log_table": index scans: 1       pages: 28 removed, 77 remain       tuples: 1990 removed, 95 remain
    system usage: CPU 0.11s/0.09u sec elapsed 5.98 sec
 
2007-11-13 09:23:12.121 PST 9466  LOG:  automatic vacuum of table 
"specdb.public.txn_log_table": index scans: 1       pages: 0 removed, 77 remain       tuples: 3178 removed, 128 remain
    system usage: CPU 0.11s/0.04u sec elapsed 5.87 sec
 
2007-11-13 09:24:12.220 PST 9470  LOG:  automatic vacuum of table 
"specdb.public.txn_log_table": index scans: 1       pages: 0 removed, 77 remain       tuples: 3394 removed, 57 remain
   system usage: CPU 0.11s/0.04u sec elapsed 5.85 sec
 
2007-11-13 09:25:12.400 PST 9474  LOG:  automatic vacuum of table 
"specdb.public.txn_log_table": index scans: 1       pages: 0 removed, 77 remain       tuples: 3137 removed, 1 remain
  system usage: CPU 0.11s/0.04u sec elapsed 5.93 sec
 
2007-11-13 09:25:18.723 PST 9180  LOG:  checkpoint complete: wrote 33362 
buffers (2.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; 
write=269.642 s, sync=0.003 s, total=269.653 s
2007-11-13 09:25:49.000 PST 9180  LOG:  checkpoint starting: time


However actual iostat output still shows non-uniform distribution  but I 
havent put the exact time stamp on the iostat outputs to correlate that 
with the logfile entries.. Maybe I should do that.

So from the PostgreSQL view things are doing fine based on outputs: I 
need to figure out the Solaris view on it now.

Could it be related to autovacuum happening also?


Regards,
Jignesh



Tom Lane wrote:
> "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
>   
>> I will turn on checkpoint_logging to get more idea as Heikki suggested
>>     
>
> Did you find out anything?
>
> Did this happen on every checkpoint, or only some of them?  The bug
> Itagaki-san pointed out today in IsCheckpointOnSchedule might account
> for some checkpoints being done at full speed, but not all ...
>
>             regards, tom lane
>   


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Tom Lane
Date:
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
> So from the PostgreSQL view things are doing fine based on outputs: I 
> need to figure out the Solaris view on it now.

> Could it be related to autovacuum happening also?

Maybe ... have you tried fiddling with the vacuum_cost_delay options?

Looking at the autovacuum log output, 

> 2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table 
> "specdb.public.txn_log_table": index scans: 1
>         pages: 11 removed, 105 remain
>         tuples: 3147 removed, 40 remain
>         system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec

it seems like a serious omission that this gives you no hint how many
pages were scanned.
        regards, tom lane


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
"Jignesh K. Shah"
Date:
I dont understand vacuum a lot.. I admit I am stupid :-)

When you say scanned... do you mean reads or do you mean writes?

Since its really writes that I am having trouble.. the auto vacuum 
message tells me 11 pages were removed and so many tuples were 
removed..  I am guessing its writes.

I can try vacuuming that table before it starts the run to see it can 
avoid that..

-Jignesh




Tom Lane wrote:
> "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
>   
>> So from the PostgreSQL view things are doing fine based on outputs: I 
>> need to figure out the Solaris view on it now.
>>     
>
>   
>> Could it be related to autovacuum happening also?
>>     
>
> Maybe ... have you tried fiddling with the vacuum_cost_delay options?
>
> Looking at the autovacuum log output, 
>
>   
>> 2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table 
>> "specdb.public.txn_log_table": index scans: 1
>>         pages: 11 removed, 105 remain
>>         tuples: 3147 removed, 40 remain
>>         system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec
>>     
>
> it seems like a serious omission that this gives you no hint how many
> pages were scanned.
>
>             regards, tom lane
>   


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Heikki Linnakangas
Date:
Jignesh K. Shah wrote:
> Since its really writes that I am having trouble.. the auto vacuum 
> message tells me 11 pages were removed and so many tuples were 
> removed..  I am guessing its writes.

Do you keep track of I/O to WAL and data separately? WAL bandwidth will 
spike up when a checkpoint starts, because of full page writes.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Looking at the autovacuum log output, 
> 
>> 2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table 
>> "specdb.public.txn_log_table": index scans: 1
>>         pages: 11 removed, 105 remain
>>         tuples: 3147 removed, 40 remain
>>         system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec
> 
> it seems like a serious omission that this gives you no hint how many
> pages were scanned.

Isn't it pages removed + remain? 116 in this case.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Alvaro Herrera
Date:
Tom Lane wrote:

> > 2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table 
> > "specdb.public.txn_log_table": index scans: 1
> >         pages: 11 removed, 105 remain
> >         tuples: 3147 removed, 40 remain
> >         system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec
> 
> it seems like a serious omission that this gives you no hint how many
> pages were scanned.

Hmm, right.  I'm not sure how to fix it; the simplest idea is to count
the number of heap page accesses in lazy_scan_heap, but this wouldn't
count index pages so it wouldn't be real.  (However, we already report
"index scans" so maybe this is not all that important).

Another, more complex idea would be to use the already existing
infrastructure for counting buffer accesses, as in ShowBufferUsage.
However, just calling ResetBufferUsage and then get the counts would
make the counters useless for the outer reporter (the callers in
postgres.c).  We could have a separate set of "save" counters; so when
vacuum starts, save the current counters and reset them; do the vacuum,
report the counters; and finally, restore the save counters by adding
the current counters.

Is this too complex?

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La tristeza es un muro entre dos jardines" (Khalil Gibran)


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
"Jignesh K. Shah"
Date:
Yes I separate out as follows:
PGDATA  + 1 TABLE which needs to be cached (also workaround CLOG read 
problem)
LOGS
DATABASE TABLES
DATABASE INDEX
to get a good view of IOs out

I have full_page_writes=off in my settings
I dont see spikes of increase on WAL during checkpoints (maybe due to my 
setting) but the constant load which is in the range of about 
2-2.5MB/sec which is not low but my load is high.

In my current run I do have async wal on with wal_writer_delay=100ms and 
commit_delay off.


-Jignesh

Heikki Linnakangas wrote:
> Jignesh K. Shah wrote:
>> Since its really writes that I am having trouble.. the auto vacuum 
>> message tells me 11 pages were removed and so many tuples were 
>> removed..  I am guessing its writes.
>
> Do you keep track of I/O to WAL and data separately? WAL bandwidth 
> will spike up when a checkpoint starts, because of full page writes.
>


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> it seems like a serious omission that this gives you no hint how many
>> pages were scanned.

> Hmm, right.  I'm not sure how to fix it; the simplest idea is to count
> the number of heap page accesses in lazy_scan_heap, but this wouldn't
> count index pages so it wouldn't be real.  (However, we already report
> "index scans" so maybe this is not all that important).

> Another, more complex idea would be to use the already existing
> infrastructure for counting buffer accesses, as in ShowBufferUsage.
> However, just calling ResetBufferUsage and then get the counts would
> make the counters useless for the outer reporter (the callers in
> postgres.c).  We could have a separate set of "save" counters; so when
> vacuum starts, save the current counters and reset them; do the vacuum,
> report the counters; and finally, restore the save counters by adding
> the current counters.

> Is this too complex?

Too complex for my taste, anyway.  I would be satisfied if the log
entries just indicated how big the table and indexes were.  Heikki
pointed out that the message does tell you how many heap pages there
were, if you know to add removed + remain; but unless I'm still missing
something there's no way to tell how many index pages.
        regards, tom lane


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> it seems like a serious omission that this gives you no hint how many
> >> pages were scanned.
> 
> Too complex for my taste, anyway.  I would be satisfied if the log
> entries just indicated how big the table and indexes were.  Heikki
> pointed out that the message does tell you how many heap pages there
> were, if you know to add removed + remain; but unless I'm still missing
> something there's no way to tell how many index pages.

Maybe we can save RelationGetNumberOfBlocks() at the start of vacuum for
heap and indexes, and report it as two numbers (total heap pages: X,
total index pages: Y).

We would still be missing the TOAST rel and index ...

-- 
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"Cuando miro a alguien, más me atrae cómo cambia que quién es" (J. Binoche)


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Tom Lane wrote:
>> Looking at the autovacuum log output, 
>>
>>> 2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table
>>> "specdb.public.txn_log_table": index scans: 1
>>>         pages: 11 removed, 105 remain
>>>         tuples: 3147 removed, 40 remain
>>>         system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec
>>
>> it seems like a serious omission that this gives you no hint how many
>> pages were scanned.
>
> Isn't it pages removed + remain? 116 in this case.

How do 40 tuples take 105 pages?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: LDC - Load Distributed Checkpoints with PG8.3b2 on Solaris

From
Alvaro Herrera
Date:
Gregory Stark wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> 
> > Tom Lane wrote:
> >> Looking at the autovacuum log output, 
> >>
> >>> 2007-11-13 09:21:19.830 PST 9458  LOG:  automatic vacuum of table
> >>> "specdb.public.txn_log_table": index scans: 1
> >>>         pages: 11 removed, 105 remain
> >>>         tuples: 3147 removed, 40 remain
> >>>         system usage: CPU 0.11s/0.09u sec elapsed 6.02 sec
> >>
> >> it seems like a serious omission that this gives you no hint how many
> >> pages were scanned.
> >
> > Isn't it pages removed + remain? 116 in this case.
> 
> How do 40 tuples take 105 pages?

The first 65 pages could be empty.

-- 
Alvaro Herrera                               http://www.PlanetPostgreSQL.org/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.              (Don Knuth)