Thread: freeze cannot be finished

freeze cannot be finished

From
Миша Тюрин
Date:

Hello!
We are experiencing suspicious and very painful case in our top-business-critical database. We have only 7 weeks before emergency stop of the cluster cause wraparound task in autovacuum process can not be finished again and again.

PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit
writing transactions rate ~ 100-200/s (100 000000 per week)

autovacuum_freeze_max_age  1000 000000
vacuum_freeze_min_age             300 000000
vacuum_freeze_table_age           900 000000

vacuum_cost_delay 40 (autovacuum_vacuum_cost_delay -1)
vacuum_cost_limit 700
vacuum_cost_page_dirty 1
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10

Freeze autovacuum ("autovacuum: VACUUM ANALYZE public.items (to prevent wraparound)") runs 4 days
! but it does not update pg_class.relfrozenxid

:((

And freeze autovacuum are starting again and again (we have seen it at least 5 times already).
There are no long-running transactions in our cluster.
And we have done manual vacuum freeze on table "items". with vacuum_cost_delay 10 it took 1day and unfortunately it did nothing with pg_class.relfrozenxid .
And we seen (strace into postrges doing vacuum freeze) that vaccuum reads and writes alot into our table's files.

It looks like vacuum at the end cannot update pg_class (maybe).
How should we finish that vacuum freeze?!
Thanks in advance!



Re: freeze cannot be finished

From
Tom Lane
Date:
=?UTF-8?B?0JzQuNGI0LAg0KLRjtGA0LjQvQ==?= <tmihail@bk.ru> writes:
> We are experiencing suspicious and very painful case in our top-business-critical database. We have only 7 weeks
beforeemergency stop of the cluster cause wraparound task in autovacuum process can not be finished again and again. 

I'm betting most of the pain comes from this:

> vacuum_cost_delay 40 (autovacuum_vacuum_cost_delay -1)
> vacuum_cost_limit 700

You've set these parameters to values that will make vacuum crawl.
It's not terribly surprising that it's failing to keep up with what
needs to be done.

I'd try setting vacuum_cost_delay to zero again until you're more
or less caught up.  After that you could perhaps adopt a value in
the range of a millisecond or two.  Not 40.

            regards, tom lane


Re[2]: [GENERAL] freeze cannot be finished

From
Миша Тюрин
Date:

Hello, Tom! Thank you for so fast answer.
We did manual vacuum freeze with delay 10 also but unfortunately vacuum did nothing with pg_class.relfrozenxid too.

Vacuum delay values like 10-40 work very well for many years in our production cases.

> It's not terribly surprising that it's failing to keep up with what
> needs to be done.
why?

Thanks in advance!

Re: freeze cannot be finished

From
Jeff Janes
Date:
On Wed, Nov 13, 2013 at 7:29 AM, Миша Тюрин <tmihail@bk.ru> wrote:

Hello!
We are experiencing suspicious and very painful case in our top-business-critical database. We have only 7 weeks before emergency stop of the cluster cause wraparound task in autovacuum process can not be finished again and again.

PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit
writing transactions rate ~ 100-200/s (100 000000 per week)

There were changes to vacuum's locking that were introduced in 9.2.4 which had some unintended consequences, which were then fixed in 9.2.5.  I didn't think that those would affect wraparound vacuums, but maybe they did.

When did you upgrade to 9.2.4?  can you go to 9.2.5?

 

autovacuum_freeze_max_age  1000 000000
vacuum_freeze_min_age             300 000000
vacuum_freeze_table_age           900 000000

vacuum_cost_delay 40 (autovacuum_vacuum_cost_delay -1)
vacuum_cost_limit 700
vacuum_cost_page_dirty 1
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10


The ratio between miss and dirty seems unusual to me.
 

Freeze autovacuum ("autovacuum: VACUUM ANALYZE public.items (to prevent wraparound)") runs 4 days
! but it does not update pg_class.relfrozenxid

Does the server log show anything about vacuuming?  If it were aborting without updating, I would think it would leave a message of some kind (even if a misleading one)

Cheers,

Jeff 

Re: freeze cannot be finished

From
Sergey Burladyan
Date:
Jeff Janes <jeff.janes@gmail.com> writes:

If I not mistaken, looks like lazy_scan_heap() called from lazy_vacuum_rel()
(see [1]) skip pages, even if it run with scan_all == true, lazy_scan_heap()
does not increment scanned_pages if lazy_check_needs_freeze() return false, so
if this occurred at wraparound vacuum it cannot update pg_class, because
pg_class updated via this code:

    new_frozen_xid = FreezeLimit;
    if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
        new_frozen_xid = InvalidTransactionId;

    vac_update_relstats(onerel,
                        new_rel_pages,
                        new_rel_tuples,
                        new_rel_allvisible,
                        vacrelstats->hasindex,
                        new_frozen_xid);

so i think in our prevent wraparound vacuum vacrelstats->scanned_pages always
less than vacrelstats->rel_pages and pg_class relfrozenxid never updated.

[1] src/backend/commands/vacuumlazy.c

--
Sergey Burladyan


Re: freeze cannot be finished

From
Jeff Janes
Date:
On Wed, Nov 13, 2013 at 3:53 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:

If I not mistaken, looks like lazy_scan_heap() called from lazy_vacuum_rel()
(see [1]) skip pages, even if it run with scan_all == true, lazy_scan_heap()
does not increment scanned_pages if lazy_check_needs_freeze() return false, so
if this occurred at wraparound vacuum it cannot update pg_class, because
pg_class updated via this code:

    new_frozen_xid = FreezeLimit;
    if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
        new_frozen_xid = InvalidTransactionId;

    vac_update_relstats(onerel,
                        new_rel_pages,
                        new_rel_tuples,
                        new_rel_allvisible,
                        vacrelstats->hasindex,
                        new_frozen_xid);

so i think in our prevent wraparound vacuum vacrelstats->scanned_pages always
less than vacrelstats->rel_pages and pg_class relfrozenxid never updated.

Yeah, I think that that is a bug.  If the clean-up lock is unavailable but the page is inspected without it and found not to need freezing, then the page needs to be counted as scanned, but is not so counted.

commit bbb6e559c4ea0fb4c346beda76736451dc24eb4e
Date:   Mon Nov 7 21:39:40 2011 -0500

But this was introduced in 9.2.0, so unless the OP didn't upgrade to 9.2 until recently, I don't know why it just started happening.

It looks like a simple fix (to HEAD attached), but I don't know how to test it.

Also, it seem like it might be worth issuing a warning if scan_all is true but all was not scanned.

Cheers,

Jeff
Attachment

Re: freeze cannot be finished

From
Heikki Linnakangas
Date:
On 14.11.2013 02:26, Jeff Janes wrote:
> On Wed, Nov 13, 2013 at 3:53 PM, Sergey Burladyan <eshkinkot@gmail.com>wrote:
>
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>
>> If I not mistaken, looks like lazy_scan_heap() called from
>> lazy_vacuum_rel()
>> (see [1]) skip pages, even if it run with scan_all == true,
>> lazy_scan_heap()
>> does not increment scanned_pages if lazy_check_needs_freeze() return
>> false, so
>> if this occurred at wraparound vacuum it cannot update pg_class, because
>> pg_class updated via this code:
>>
>>      new_frozen_xid = FreezeLimit;
>>      if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
>>          new_frozen_xid = InvalidTransactionId;
>>
>>      vac_update_relstats(onerel,
>>                          new_rel_pages,
>>                          new_rel_tuples,
>>                          new_rel_allvisible,
>>                          vacrelstats->hasindex,
>>                          new_frozen_xid);
>>
>> so i think in our prevent wraparound vacuum vacrelstats->scanned_pages
>> always
>> less than vacrelstats->rel_pages and pg_class relfrozenxid never updated.
>
> Yeah, I think that that is a bug.  If the clean-up lock is unavailable but
> the page is inspected without it and found not to need freezing, then the
> page needs to be counted as scanned, but is not so counted.
>
> commit bbb6e559c4ea0fb4c346beda76736451dc24eb4e
> Date:   Mon Nov 7 21:39:40 2011 -0500
>
> But this was introduced in 9.2.0, so unless the OP didn't upgrade to 9.2
> until recently, I don't know why it just started happening.
>
> It looks like a simple fix (to HEAD attached), but I don't know how to test
> it.

Thanks, committed.

I was able to reproduce it by doing this:

-- Create and populate a test table
create table foo (i int4);
insert into foo select generate_series(1, 10000);

-- Freeze it, and observe relfrozenxid.
vacuum freeze foo;
select relfrozenxid from pg_class where oid='foo'::regclass;

-- Do some transactions to bump current transaction ID
insert into foo values (-1);
insert into foo values (-1);
insert into foo values (-1);
insert into foo values (-1);

-- Now, in a second psql session, open a cursor on the table. It keeps
the current page pinned, which causes the ConditionalLockBuffer() in
vacuumlazy.c to fail to grab the lock:

  begin; declare  foocur cursor for select * from foo;  fetch foocur;

-- Back to the original psql session, vacuum freeze again:
vacuum freeze foo;

-- Observe the new relfrozenxid. With the bug, it's the same as before.
Vacuum freeze is not able to advance relfrozenxid because it skipped the
page by the cursor. With the patch, it does advance.

select relfrozenxid from pg_class where oid='foo'::regclass;

> Also, it seem like it might be worth issuing a warning if scan_all is true
> but all was not scanned.

Hmm, the new definition of "scanned", with this patch, is that pages
that were not vacuumed are still counted as scanned. I don't think a
warning is appropriate, there isn't anything wrong with skipping pinned
pages that don't need freezing, the amount of bloat left behind should
be tiny. For diagnostic purposes, perhaps they should be counted
separately in vacuum verbose output, though.

- Heikki