Thread: [postgresql 9.3.5] autovacuums stuck on non-existent pg_toast tables

[postgresql 9.3.5] autovacuums stuck on non-existent pg_toast tables

From
tpham
Date:
Hi everyone,

Two weeks ago, one of our Postgres databases crashed violently and had to be
brought up again. This certainly resulted in some lost pg_clog files, and we
had to zero-fill them in one by one to get autovacuum up and running again.

Now, we have two autovacuuming processes constantly stuck at two pg_toast
tables that are nonexistent:

autovacuum: VACUUM pg_toast.pg_toast_455742374 (runs up to twelve hours with
no progress)

# select 455742374::regclass;
 regclass
-----------
 455742374
(1 row)

whereas normally for a pg_toast table we should have something like

# select 1371237940::regclass;
                    regclass
-------------------------------------------------
 performance_report_parallel.campaigns_summary_p
(1 row)

The result is that we are down to just one autovacuum worker, which is
evidently not enough to help our database oiled and running. :"(

Our autovacuum paramters are all using default values.

Any help would be greatly appreciated. Thank you!



--
View this message in context:
http://postgresql.nabble.com/postgresql-9-3-5-autovacuums-stuck-on-non-existent-pg-toast-tables-tp5839397.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [postgresql 9.3.5] autovacuums stuck on non-existent pg_toast tables

From
Alvaro Herrera
Date:
tpham wrote:
> Hi everyone,
>
> Two weeks ago, one of our Postgres databases crashed violently and had to be
> brought up again. This certainly resulted in some lost pg_clog files, and we
> had to zero-fill them in one by one to get autovacuum up and running again.

You should never lose pg_clog files to any type of crash.  This kind of
problem shows up when you have broken config somehow, perhaps running
with fsync=off or your disks have write cache enabled and no
battery-backup for it.  This seems like a serious issue that should be
investigated more closely.

Your filesystem might have put the lost files in lost+found.

> Now, we have two autovacuuming processes constantly stuck at two pg_toast
> tables that are nonexistent:
>
> autovacuum: VACUUM pg_toast.pg_toast_455742374 (runs up to twelve hours with
> no progress)
>
> # select 455742374::regclass;
>  regclass
> -----------
>  455742374
> (1 row)

So, you could look at the toast table directly in pg_class, and perhaps
delete the pg_class entry for the stale pg_toast table and the file if
there's any.  For instance, try

select relfilenode from pg_class where relname = 'pg_toast_455742374'

which would give you the file name of the offending toast table so that
you can remove it in the filesystem.  You can then run a DELETE against
pg_class.

Another (probably better) way to look for the table would be something
like

select *
  from pg_class
 where reltoastrelid = (select oid
                          from pg_class
                         where relname = 'pg_toast_455742374');
or similar.

I think you should pg_dump this database and restore it in a newly
initdb'd directory.  Otherwise, who knows what other inconsistency you
might have in your data.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [postgresql 9.3.5] autovacuums stuck on non-existent pg_toast tables

From
Alvaro Herrera
Date:
Tong Pham wrote:

> We do have fsync turned on, and there was no disk failure. The database
> had to be shut down forcefully because it was becoming nonresponsive
> (probably due to inadequate earlier vacuuming) and we could not get the
> remaining queries to terminate with normal cancel/terminate operations (so
> a SIGKILL was used ‹ which was bad.)

Yes, SIGKILL is a bad idea, but it shouldn't lead to lost pg_clog files
nonetheless.  Is there a network filesystem involved perhaps?

> I did look inside pg_class, but there is absolutely no mention of this
> pg_toast table.

Well, so you can DELETE FROM pg_class the broken entry and you will get
rid of the autovacuum problem.  (Removing from the filesystem would be
good as well if the file is large, to save disk space.)

> This database has a streaming replica that has been
> healthy through the previous crash. If we decide to switch over to this
> replica, do you think these kinds of inconsistencies would be on the
> replica as well?

No idea there.  Depending on how the files disappeared, maybe the
replica has the same problem.

> A pg_dumpall is possible but might take too much time.

There's a nice saying, "I can make this arbitrarily fast if it doesn't
have to give the right answer".

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [postgresql 9.3.5] autovacuums stuck on non-existent pg_toast tables

From
Erik Jones
Date:
> On Feb 25, 2015, at 3:27 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Tong Pham wrote:
>
>> We do have fsync turned on, and there was no disk failure. The database
>> had to be shut down forcefully because it was becoming nonresponsive
>> (probably due to inadequate earlier vacuuming) and we could not get the
>> remaining queries to terminate with normal cancel/terminate operations (so
>> a SIGKILL was used ‹ which was bad.)

FWIW, when the server isn’t responding to pg_cancel_backend()/pt_terminate_backend() processes will typically go away
ifyou directly issue a SIGTERM from the system, e.g. `kill <pid>` from bash or whatever. 



Re: [postgresql 9.3.5] autovacuums stuck on non-existent pg_toast tables

From
Tong Pham
Date:
Hi Alvaro,

Thank you for your quick response!

We do have fsync turned on, and there was no disk failure. The database
had to be shut down forcefully because it was becoming nonresponsive
(probably due to inadequate earlier vacuuming) and we could not get the
remaining queries to terminate with normal cancel/terminate operations (so
a SIGKILL was used ‹ which was bad.)

I did look inside pg_class, but there is absolutely no mention of this
pg_toast table. This database has a streaming replica that has been
healthy through the previous crash. If we decide to switch over to this
replica, do you think these kinds of inconsistencies would be on the
replica as well? A pg_dumpall is possible but might take too much time.

Thanks!

Regards,
Tong

On 2/25/15, 11:42 AM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:

>tpham wrote:
>> Hi everyone,
>>
>> Two weeks ago, one of our Postgres databases crashed violently and had
>>to be
>> brought up again. This certainly resulted in some lost pg_clog files,
>>and we
>> had to zero-fill them in one by one to get autovacuum up and running
>>again.
>
>You should never lose pg_clog files to any type of crash.  This kind of
>problem shows up when you have broken config somehow, perhaps running
>with fsync=off or your disks have write cache enabled and no
>battery-backup for it.  This seems like a serious issue that should be
>investigated more closely.
>
>Your filesystem might have put the lost files in lost+found.
>
>> Now, we have two autovacuuming processes constantly stuck at two
>>pg_toast
>> tables that are nonexistent:
>>
>> autovacuum: VACUUM pg_toast.pg_toast_455742374 (runs up to twelve hours
>>with
>> no progress)
>>
>> # select 455742374::regclass;
>>  regclass
>> -----------
>>  455742374
>> (1 row)
>
>So, you could look at the toast table directly in pg_class, and perhaps
>delete the pg_class entry for the stale pg_toast table and the file if
>there's any.  For instance, try
>
>select relfilenode from pg_class where relname = 'pg_toast_455742374'
>
>which would give you the file name of the offending toast table so that
>you can remove it in the filesystem.  You can then run a DELETE against
>pg_class.
>
>Another (probably better) way to look for the table would be something
>like
>
>select *
>  from pg_class
> where reltoastrelid = (select oid
>                          from pg_class
>                         where relname = 'pg_toast_455742374');
>or similar.
>
>I think you should pg_dump this database and restore it in a newly
>initdb'd directory.  Otherwise, who knows what other inconsistency you
>might have in your data.
>
>--
>Álvaro Herrera                http://www.2ndQuadrant.com/
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi Erik,

I thouht pg_cancel_backend and pg_terminate_backend simply send SIGINT and
SIGTERM, respectively, to the backend process? We did try using the commands
and sending the signals ourselves, but to no avail.



--
View this message in context:
http://postgresql.nabble.com/postgresql-9-3-5-autovacuums-stuck-on-non-existent-pg-toast-tables-tp5839397p5839896.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Hi Álvaro,

Pardon my stupidity. I kept looking for things in the wrong place. The
pg_toast tables have always been there, I was just issuing the commands yo
mentioned in a different database. Need to get more sleep. xD

In the end, we decided the entire database needed a good cleaning. So we
scheduled a four-hour downtime, during which we restarted the database on a
different port (so that nothing would connect to it), increased the number
of autovacuum processes from three to five, and used vaccumdb to vacuum
analyze the top four dabatases in terms of usage.

We thought the pg_toast tables were still somehow corrupted, but a pg_dump
showed that we could dump it just fine. So we let them be. The next day,
performance was back to normal, and autovacuum was able to actually finish
working on these pg_toast tables by themselves. :")

Thank you so much for your help! I have questions about monitoring
autovacuum monitoring, but I'll create another thread for that!

Regards,
Tong



--
View this message in context:
http://postgresql.nabble.com/postgresql-9-3-5-autovacuums-stuck-on-non-existent-pg-toast-tables-tp5839397p5839897.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.