Thread: serious problems with vacuuming databases

serious problems with vacuuming databases

From
Tomas Vondra
Date:
Hello,

we have some performance problems with postgres 8.0.4, more precisely
with vacuuming 'large' database with a lot of deleted rows.

We had a 3.2 GB database, consisting mainly from 4 large tables, two of
them (say table A and B) having about 14.000.000 of rows and 1 GB of
size each, and two (say C and D) having about 4.000.000 of rows and 500
MB each. The rest of the database is not important.

We've decided to remove unneeded 'old' data, which means removing about
99.999% of rows from tables A, C and D (about 2 GB of data). At the
beginning, the B table (containing aggregated from A, C and D) was
emptied (dropped and created) and filled in with current data. Then,
before the deletion the data from tables A, C, D were backed up using
another tables (say A_old, C_old, D_old) filled in using

   INSERT INTO A SELECT * FROM A_old ...

and fixed so there are no duplicities (rows both in A and A_old). Then
these data were deleted from A, C, D and tables A_old, C_old and D_old
were dumped, truncated and all the tables were vacuumed (with FULL
ANALYZE options). So the procedure was this

1) drop, create and fill table B (aggregated data from A, C, D)
2) copy 'old' data from A, C and D to A_old, C_old a D_old
3) delete old data from A, C, D
4) dump data from A_old, C_old and D_old
5) truncate tables A, C, D
6) vacuum full analyze tables A, C, D, A_old, C_old and D_old

So the dump of the fatabase has about 1.2 GB of data, from which about
1 GB is in the B table (the one rebuilt in step 1). This was done yesterday.

The problem is this - today, we run a scheduled VACUUM FULL ANALYZE for
the whole database, and it runs for about 10 hours already, which is
much more than usual (and it is still running).

The hardware is not too bad - it's Dell server with 2 x 3.0 GHz P4 HT,
4GB of RAM, 2x15k SCSI drives in hw RAID etc.

The question is why this happens and how to get round that. I guess it's
caused by a huge amount of data deleted yesterday, but on the other side
all the modified tables were vacuumed at the end. But I guess dropping
and reloading the whole database would be much faster (at most 1.5 hour
including creating indexes etc.)

thanks for your advices
Tomas

Re: serious problems with vacuuming databases

From
Alvaro Herrera
Date:
Tomas Vondra wrote:

Hi,

> Then
> these data were deleted from A, C, D and tables A_old, C_old and D_old
> were dumped, truncated and all the tables were vacuumed (with FULL
> ANALYZE options). So the procedure was this
>
> 1) drop, create and fill table B (aggregated data from A, C, D)
> 2) copy 'old' data from A, C and D to A_old, C_old a D_old
> 3) delete old data from A, C, D
> 4) dump data from A_old, C_old and D_old
> 5) truncate tables A, C, D
> 6) vacuum full analyze tables A, C, D, A_old, C_old and D_old
>
> So the dump of the fatabase has about 1.2 GB of data, from which about
> 1 GB is in the B table (the one rebuilt in step 1). This was done yesterday.
>
> The problem is this - today, we run a scheduled VACUUM FULL ANALYZE for
> the whole database, and it runs for about 10 hours already, which is
> much more than usual (and it is still running).

Probably the indexes are bloated after the vacuum full.  I think the
best way to get rid of the "fat" is to recreate both tables and indexes
anew.  For this the best tool would be to CLUSTER the tables on some
index, probably the primary key.  This will be much faster than
VACUUMing the tables, and the indexes will be much smaller as result.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: serious problems with vacuuming databases

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> 1) drop, create and fill table B (aggregated data from A, C, D)
> 2) copy 'old' data from A, C and D to A_old, C_old a D_old
> 3) delete old data from A, C, D
> 4) dump data from A_old, C_old and D_old
> 5) truncate tables A, C, D
> 6) vacuum full analyze tables A, C, D, A_old, C_old and D_old

Steps 3/5/6 make no sense at all to me: why bother deleting data retail
when you are about to truncate the tables, and why bother vacuuming a
table you just truncated?  Is the above *really* what you did?

> The problem is this - today, we run a scheduled VACUUM FULL ANALYZE for
> the whole database, and it runs for about 10 hours already, which is
> much more than usual (and it is still running).

Is it actually grinding the disk, or is it just blocked waiting for
someone's lock?  If it's actually doing work, which table is it working
on?  (You should be able to figure that out by looking in pg_locks,
or by strace'ing the process to see which files it's touching.)

            regards, tom lane

Re: serious problems with vacuuming databases

From
Tomas Vondra
Date:
Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> 1) drop, create and fill table B (aggregated data from A, C, D)
>> 2) copy 'old' data from A, C and D to A_old, C_old a D_old
>> 3) delete old data from A, C, D
>> 4) dump data from A_old, C_old and D_old
>> 5) truncate tables A, C, D
>> 6) vacuum full analyze tables A, C, D, A_old, C_old and D_old
>
> Steps 3/5/6 make no sense at all to me: why bother deleting data retail
> when you are about to truncate the tables, and why bother vacuuming a
> table you just truncated?  Is the above *really* what you did?

Yes, the above is exactly what I did with the exception that there's an
error in the step (5) - there should be truncation of the _old tables.
The reasons that led me to this particular steps are two:

(a) I don't want to delete all the data, just data older than two days.
    Until today we've kept all the data (containing two years access log
    for one of our production websites), but now we've decided to remove
    the data we don't need and leave just the aggregated version. That's
    why I have used DELETE rather than TRUNCATE.

(b) I want to create 'incremental' backups, so once I'll need the data
    I can take several packages (dumps of _old tables) and import them
    one after another. Using pg_dump doesn't allow me this - dumping the
    whole tables A, C and D is not an option, because I want to leave
    some of the data in the tables.

    From now on, the tables will be cleared on a daily (or maybe weekly)
    basis, which means much smaller amount of data (about 50.000 rows
    a day).
>
>> The problem is this - today, we run a scheduled VACUUM FULL ANALYZE for
>> the whole database, and it runs for about 10 hours already, which is
>> much more than usual (and it is still running).
>
> Is it actually grinding the disk, or is it just blocked waiting for
> someone's lock?  If it's actually doing work, which table is it working
> on?  (You should be able to figure that out by looking in pg_locks,
> or by strace'ing the process to see which files it's touching.)

Thanks for the hint, I'll try to figure that in case the dump/reload
recommended by Alvaro Herrera doesn't help. But as far as I know the
disks are not grinded right now, so I guess it's the problem with indexes.

t.v.

Re: serious problems with vacuuming databases

From
Tomas Vondra
Date:
> Probably the indexes are bloated after the vacuum full.  I think the
> best way to get rid of the "fat" is to recreate both tables and indexes
> anew.  For this the best tool would be to CLUSTER the tables on some
> index, probably the primary key.  This will be much faster than
> VACUUMing the tables, and the indexes will be much smaller as result.

I guess you're right. I forgot to mention there are 12 composed indexes
on the largest (and not deleted) table B, having about 14.000.000 rows
and 1 GB of data. I'll try to dump/reload the database ...

t.v.

Re: serious problems with vacuuming databases

From
Tomas Vondra
Date:
> I guess you're right. I forgot to mention there are 12 composed indexes
> on the largest (and not deleted) table B, having about 14.000.000 rows
> and 1 GB of data. I'll try to dump/reload the database ...

Aaargh, the problem probably is not caused by the largest table, as it
was dropped, filled in with the data and after that all the indexes were
created. The problem could be caused by the tables with deleted data, of
course.

t.v.

Re: serious problems with vacuuming databases

From
Alvaro Herrera
Date:
Tomas Vondra wrote:
> > Probably the indexes are bloated after the vacuum full.  I think the
> > best way to get rid of the "fat" is to recreate both tables and indexes
> > anew.  For this the best tool would be to CLUSTER the tables on some
> > index, probably the primary key.  This will be much faster than
> > VACUUMing the tables, and the indexes will be much smaller as result.
>
> I guess you're right. I forgot to mention there are 12 composed indexes
> on the largest (and not deleted) table B, having about 14.000.000 rows
> and 1 GB of data. I'll try to dump/reload the database ...

Huh, I didn't suggest to dump/reload.  I suggested CLUSTER.  You need to
apply it only to tables where you have lots of dead tuples, which IIRC
are A, C and D.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: serious problems with vacuuming databases

From
Tomas Vondra
Date:
> Huh, I didn't suggest to dump/reload.  I suggested CLUSTER.  You need to
> apply it only to tables where you have lots of dead tuples, which IIRC
> are A, C and D.

Sorry, I should read more carefully. Will clustering a table according
to one index solve problems with all the indexes on the table (if the
table has for example two indexes?).

t.v.

Re: serious problems with vacuuming databases

From
Alvaro Herrera
Date:
Tomas Vondra wrote:
> > Huh, I didn't suggest to dump/reload.  I suggested CLUSTER.  You need to
> > apply it only to tables where you have lots of dead tuples, which IIRC
> > are A, C and D.
>
> Sorry, I should read more carefully. Will clustering a table according
> to one index solve problems with all the indexes on the table (if the
> table has for example two indexes?).

Yes, it will rebuild all indexes.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: serious problems with vacuuming databases

From
"Ahmad Fajar"
Date:
Hi Tomas,

Tomas wrote:
We've decided to remove unneeded 'old' data, which means removing about
99.999% of rows from tables A, C and D (about 2 GB of data). At the
beginning, the B table (containing aggregated from A, C and D) was emptied
(dropped and created) and filled in with current data. Then, before the
deletion the data from tables A, C, D were backed up using another tables
(say A_old, C_old, D_old) filled in using
.....
1) drop, create and fill table B (aggregated data from A, C, D)
2) copy 'old' data from A, C and D to A_old, C_old a D_old
3) delete old data from A, C, D
4) dump data from A_old, C_old and D_old
5) truncate tables A, C, D
6) vacuum full analyze tables A, C, D, A_old, C_old and D_old
----

I think you do some difficult database maintainance. Why you do that, if you
just want to have some small piece of datas from your tables. Why don't you
try something like:
1. create table A with no index (don't fill data to this table),
2. create table A_week_year inherit table A, with index you want, and some
condition for insertion. (eg: table A1 you used for 1 week data of a year
and so on..)
3. do this step for table B, C and D
4. if you have relation, make the relation to inherit table (optional).

I think you should read the postgresql help, for more information about
table inheritance.

The impact is, you might have much table. But each table will only have
small piece of datas, example: just for one week. And you don't have to do a
difficult database maintainance like you have done. You just need to create
tables for every week of data, do vacuum/analyze and regular backup.


Best regards,
ahmad fajar,



Re: serious problems with vacuuming databases

From
Tomas Vondra
Date:
> Hi Tomas,
>
> Tomas wrote:
> We've decided to remove unneeded 'old' data, which means removing about
> 99.999% of rows from tables A, C and D (about 2 GB of data). At the
> beginning, the B table (containing aggregated from A, C and D) was emptied
> (dropped and created) and filled in with current data. Then, before the
> deletion the data from tables A, C, D were backed up using another tables
> (say A_old, C_old, D_old) filled in using
> .....
> 1) drop, create and fill table B (aggregated data from A, C, D)
> 2) copy 'old' data from A, C and D to A_old, C_old a D_old
> 3) delete old data from A, C, D
> 4) dump data from A_old, C_old and D_old
> 5) truncate tables A, C, D
> 6) vacuum full analyze tables A, C, D, A_old, C_old and D_old
> ----
>
> I think you do some difficult database maintainance. Why you do that, if you
> just want to have some small piece of datas from your tables. Why don't you
> try something like:
> 1. create table A with no index (don't fill data to this table),
> 2. create table A_week_year inherit table A, with index you want, and some
> condition for insertion. (eg: table A1 you used for 1 week data of a year
> and so on..)
> 3. do this step for table B, C and D
> 4. if you have relation, make the relation to inherit table (optional).
>
> I think you should read the postgresql help, for more information about
> table inheritance.
>
> The impact is, you might have much table. But each table will only have
> small piece of datas, example: just for one week. And you don't have to do a
> difficult database maintainance like you have done. You just need to create
> tables for every week of data, do vacuum/analyze and regular backup.
>
>
> Best regards,
> ahmad fajar,

Thanks for your advice, but I've read the sections about inheritance and
I don't see a way how to use that in my case, as I think the inheritance
takes care about the structure, not about the data.

But I've read a section about partitioning (using inheritance) too, and
it seems useful. I'll try to solve the performance issues using this.

Thanks for your advices
Tomas