Thread: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Andreas Thiel

Date:

05 December 2009, 14:44:36

Hi All,


Maybe some questions are quite newbie ones, and I did try hard to scan
all the articles and documentation, but I did not find a satisfying
answer.

I'm running PostgreSQL 8.3.6 on a 32-Bit Centos 4 machine (which I
probably should update to 64 Bit soon)


I have some tables which tend to get huge (and will for sure hit the
wall of my storage system soon, total DB ~700 GB now):

SELECT relfilenode, relpages,reltuples,relname FROM pg_class WHERE
relpages > 10000 ORDER BY relpages DESC;
 relfilenode | relpages |  reltuples  |             relname
-------------+----------+-------------+---------------------------------
-
       72693 | 51308246 | 4.46436e+09 | result_orig
       72711 | 17871658 | 6.15227e+06 | test
       73113 | 12240806 | 4.46436e+09 | result_orig_test_id
       73112 | 12240806 | 4.46436e+09 | result_orig_prt_id
       72717 |   118408 | 6.15241e+06 | test_orig
       72775 |    26489 | 6.15241e+06 | test_orig_lt_id
       72755 |    19865 | 6.15241e+06 | test_orig_test_id_key
       73147 |    16872 | 6.15227e+06 | test_test_id
       73146 |    16872 | 6.15227e+06 | test_lt_id


I'm going to work on the table size of the largest table (result_orig)
itself by eliminating columns, stuffing n Booleans into bit(n)'s,
replacing double precision by reals, etc.. By this I should be able to
reduce the storage per row to ~1/3 of the bytes currently used.

I have the same information stored in an Oracle 10g DB which consumes
only 70G data and 2G for indexes. The schema may be better optimized,
but for sure there is a table with 4 billion rows inside as well. So
it's about 10x smaller in disk space than PgSQL. I wonder why.

But still:

### My Issue No. 1: Index Size
What really worries me is the size of the two largest indexes
(result_orig_test_id, result_orig_prt_id) I'm using. Both are roughly
1/3 of the result_orig table size and each index only b-tree indexes a
single bigint column (prt_id, test_id) of result_orig. Roughly every
group of 100 rows of result_orig have the same prt_id, roughly every
group of 1000-10000 rows have the same test_id.  Each of these two cols
is a Foreign Key (ON DELETE CASCADE).

So my fear is now, even if I can reduce the amount of data per row in
result_orig, my indexes will remain as large as before and then dominate
disk usage.

Is such disk usage for indexes expected? What can I do to optimize? I
could not run yet a VACUUM on result_orig, as I hit into max_fsm_pages
limit (still trying to adjust that one). I tried REINDEX, it didn't
change anything.


### My Issue No. 2: relpages and VACUUM
I have another table "test" which is - as starting point - created by
INSERTs and then UPDATE'd. It has the same columns and roughly the same
number of rows as table test_orig,  but consumes 160 times the number of
pages. I tried VACUUM on this table but it did not change anything on
its relpages count. Maybe this is just because VACUUM without FULL does
not re-claim disk space, i.e. relpages stays as it is? I did observe
that after VACUUM, a REINDEX on this table did considerably shrink down
the size of its indexes (test_test_id, test_lt_id).


### My Issue No 3: VACCUM FULL out of memory
I tried to do a VACCUM FULL on the two tables (test, result_orig)
mentioned above. In both cases it fails with a very low number on out of
memory like this:

ERROR:  out of memory
DETAIL:  Failed on request of size 224.

I use these kernel settings:
kernel.shmmni = 4096
kernel.shmall = 2097152
kernel.shmmax = 2147483648
vm.overcommit_memory = 2

And these postgresql.conf settings:
shared_buffers = 512MB                  # min 128kB or
max_connections*16kB
temp_buffers = 128MB                    # min 800kB
max_prepared_transactions = 1024        # can be 0 or more
work_mem = 16MB                         # min 64kB
maintenance_work_mem = 256MB            # min 1MB
max_stack_depth = 8MB                   # min 100kB
max_fsm_pages = 70000000                # min max_fsm_relations*16, 6
bytes each
max_fsm_relations = 4194304             # min 100, ~70 bytes each
#max_files_per_process = 1000           # min 25
#shared_preload_libraries = ''          # (change requires restart)

What's going wrong here? I know, one should not use VACUUM FULL, but I
was curious to see if this would have any impact on relpages count
mentioned in Issue 2.


###My Issue No. 4: Autovacuum
I have the feeling that Autovacuum is not really running, else why are
tables and indexes growing that much, especially "test" table?

#-----------------------------------------------------------------------
-------
# AUTOVACUUM PARAMETERS
#-----------------------------------------------------------------------
-------

autovacuum = on                         # Enable autovacuum subprocess?
'on'
log_autovacuum_min_duration = 1000      # -1 disables, 0 logs all
actions and
autovacuum_max_workers = 3              # max number of autovacuum
subprocesses
autovacuum_naptime = 1min               # time between autovacuum runs
autovacuum_vacuum_threshold = 50        # min number of row updates
before
autovacuum_analyze_threshold = 50       # min number of row updates
before
autovacuum_vacuum_scale_factor = 0.2    # fraction of table size before
vacuum
autovacuum_analyze_scale_factor = 0.1   # fraction of table size before
analyze
autovacuum_freeze_max_age = 200000000   # maximum XID age before forced
vacuum
autovacuum_vacuum_cost_delay = 20       # default vacuum cost delay for
autovacuum_vacuum_cost_limit = -1       # default vacuum cost limit for

How would I check it is running correctly? I don't see any error
messages in syslog from autovacuum.



Any help, also on tuning postgresql.conf to this application, is greatly
appreciated!

Thanks

Andy

Re: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Andres Freund

Date:

05 December 2009, 17:01:56

Hi,

On Saturday 05 December 2009 00:03:12 Andreas Thiel wrote:
> I'm running PostgreSQL 8.3.6 on a 32-Bit Centos 4 machine (which I
> probably should update to 64 Bit soon)
How much memory?


> I'm going to work on the table size of the largest table (result_orig)
> itself by eliminating columns, stuffing n Booleans into bit(n)'s,
> replacing double precision by reals, etc.. By this I should be able to
> reduce the storage per row to ~1/3 of the bytes currently used.
That sounds rather ambitous - did you factor in the per row overhead?

> I have the same information stored in an Oracle 10g DB which consumes
> only 70G data and 2G for indexes. The schema may be better optimized,
> but for sure there is a table with 4 billion rows inside as well. So
> it's about 10x smaller in disk space than PgSQL. I wonder why.
Thats hard to say without seeing the table definition for both. Could you post
it?

2GB for indexes sounds rather small - those are btrees?

It might also be interesting to look into the freespacemap to see how much
empty space there is - there is a contrib module pg_freespacemap for that.

You can also check how much dead tuples a 'ANALYZE VERBOSE tablename' sees.

> Is such disk usage for indexes expected? What can I do to optimize? I
> could not run yet a VACUUM on result_orig, as I hit into max_fsm_pages
> limit (still trying to adjust that one). I tried REINDEX, it didn't
> change anything.
So its quite possible that your relations are heavily bloated - altough if you
reindex that shouldnt matter that much.

Btw, have you possibly left over some old prepared transactions or an idle in
transaction connection? Both can lead to sever bloat.
For the former you can check the system table pg_prepared_xact for the latter
pg_stat_activity.

> ### My Issue No. 2: relpages and VACUUM
> I have another table "test" which is - as starting point - created by
> INSERTs and then UPDATE'd. It has the same columns and roughly the same
> number of rows as table test_orig,  but consumes 160 times the number of
> pages. I tried VACUUM on this table but it did not change anything on
> its relpages count. Maybe this is just because VACUUM without FULL does
> not re-claim disk space, i.e. relpages stays as it is? I did observe
> that after VACUUM, a REINDEX on this table did considerably shrink down
> the size of its indexes (test_test_id, test_lt_id).
A normal VACUUM does not move tuples around - it only marks space as free so
it can later be filled.

(If the free space is trailing it tries to free it if there are no locks
preventing it).

> ### My Issue No 3: VACCUM FULL out of memory
> I tried to do a VACCUM FULL on the two tables (test, result_orig)
> mentioned above. In both cases it fails with a very low number on out of
> memory like this:
>
> ERROR:  out of memory
> DETAIL:  Failed on request of size 224.
Well, thats the number of memory its trying to allocate, not the amount it has
allocated. Normally the postmaster should output some sort of memory map when
that happens. Did you get anything like that?

> I use these kernel settings:
> kernel.shmmni = 4096
> kernel.shmall = 2097152
> kernel.shmmax = 2147483648
> vm.overcommit_memory = 2

> max_stack_depth = 8MB                   # min 100kB
That sounds a bit too high if you count in that libc and consorts may use some
stack space as well - although that should be unrelated to the current issue.

> max_fsm_pages = 70000000                # min max_fsm_relations*16, 6
> bytes each
As a very rough guide you can start with the sum of relpages in pg_class for
that one.

> max_fsm_relations = 4194304             # min 100, ~70 bytes each
That seems kinda high. Do you have multiple millions of relations? It might be
related to the oom situation during vacuum full, although it seems rather
unlikely.

> ###My Issue No. 4: Autovacuum
> I have the feeling that Autovacuum is not really running, else why are
> tables and indexes growing that much, especially "test" table?
You should see notes about autovacuum in the locks. With an
autovacuum_vacuum_scale_factor of  0.2 you need
0.002 times the size of a table in changed tuples before autovacuum starts.
For a billion thats quite a bit. I found that this setting often is too high.

> How would I check it is running correctly? I don't see any error
> messages in syslog from autovacuum.
You should see messages about it starting in the syslog.


Andres

Re: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Craig Ringer

Date:

05 December 2009, 21:16:08

On 5/12/2009 7:03 AM, Andreas Thiel wrote:
> Hi All,
>
>
> Maybe some questions are quite newbie ones, and I did try hard to scan
> all the articles and documentation, but I did not find a satisfying
> answer.

> ### My Issue No. 1: Index Size
> Is such disk usage for indexes expected? What can I do to optimize? I
> could not run yet a VACUUM on result_orig, as I hit into max_fsm_pages
> limit

You'll like 8.4 then, as you no longer have to play with max_fsm_pages.

The fact that you're hitting max_fsm_pages suggests that you are
probably going to be encountering table bloat.

Of course, to get to 8.4 you're going to have to go through a dump and
reload of doom...

> ### My Issue No. 2: relpages and VACUUM
> I have another table "test" which is - as starting point - created by
> INSERTs and then UPDATE'd. It has the same columns and roughly the same
> number of rows as table test_orig,  but consumes 160 times the number of
> pages. I tried VACUUM on this table but it did not change anything on
> its relpages count. Maybe this is just because VACUUM without FULL does
> not re-claim disk space, i.e. relpages stays as it is? I did observe
> that after VACUUM, a REINDEX on this table did considerably shrink down
> the size of its indexes (test_test_id, test_lt_id).

CLUSTER is often convenient for re-writing a highly bloated table.
You'll need enough free disk space to hold the real rows from the table
twice, plus the dead space once, while CLUSTER runs.

--
Craig Ringer

Re: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Greg Smith

Date:

05 December 2009, 21:34:06

Craig Ringer wrote:
>> ### My Issue No. 1: Index Size
>> Is such disk usage for indexes expected? What can I do to optimize? I
>> could not run yet a VACUUM on result_orig, as I hit into max_fsm_pages
>> limit
>
> You'll like 8.4 then, as you no longer have to play with max_fsm_pages.
> The fact that you're hitting max_fsm_pages suggests that you are
> probably going to be encountering table bloat.
> Of course, to get to 8.4 you're going to have to go through a dump and
> reload of doom...
Yeah, increasing max_fsm_pages and seeing what VACUUM VERBOSE tells you
afterwards is job #1, as all of the information you're getting now is
useless if VACUUM is stalled out on a giant task.  It should be possible
to migrate from 8.3 to 8.4 using pg_migrator rather than doing a dump
and reload.  I would recommend considering that as soon as
possible--your options are either to learn a lot about better VACUUM
practice and being diligent to make sure you never exceed it in the
future, or to switch to 8.4 and it will take care of itself.

You also need to be careful not to let the system run completely out of
disk space before doing something about this, because CLUSTER (the only
useful way to clean up after a VACUUM mistake of the magnitude you're
facing now) requires making a second copy of the live data in the table
as its method to clean things up. That option goes away once you're
really low on disk space, and if you get backed into that corner by that
you'll really be stuck.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Andres Freund

Date:

06 December 2009, 12:51:32

Hi Andreas,

Could you please properly quote the email? The way you did it is quite
unreadable because you always have to guess who wrote what.

On Sunday 06 December 2009 17:06:39 Andreas Thiel wrote:
> > I'm going to work on the table size of the largest table (result_orig)
> > itself by eliminating columns, stuffing n Booleans into bit(n)'s,
> > replacing double precision by reals, etc.. By this I should be able to
> > reduce the storage per row to ~1/3 of the bytes currently used.
> That sounds rather ambitous - did you factor in the per row overhead?
> I did now create the new table, I have now 63 instead of 94 bytes/row on
> average. So yes you're right I'm about to hit the bottom of the per row
> overhead.
How did you calculate that? Did you factor in the alignment requirements? The
ddl would be helpfull...

> Btw, have you possibly left over some old prepared transactions or an
> idle in
> transaction connection? Both can lead to sever bloat.
> For the former you can check the system table pg_prepared_xact for the
> latter
> pg_stat_activity.
> Seems no the case, pg_prepared_xact doesn't even exist.
Its pg_prepared_xacts (note the s), sorry my mind played me.

> Where would I find that postmaster output? In syslog? There's nothing
> visible...
Depends on your setup. I have not the slightest clue about centos. If
necessary start postmaster directly.

> > max_fsm_relations = 4194304             # min 100, ~70 bytes each
Have you corrected that value?


Andres

Re: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Andres Freund

Date:

06 December 2009, 15:10:12

On Sunday 06 December 2009 19:20:17 Andreas Thiel wrote:
> Hi Andres,
>
> Thanks a lot for your answers. As bottom line I think the answer is I
> have to rethink my DB structure.
Can't answer that one without knowing much more ;)

> > Could you please properly quote the email? The way you did it is quite
> > unreadable because you always have to guess who wrote what.
> I try to, is it now getting better? My apologies, still trying to adopt
> to using Office 07:-)
Better, yes.


> Well, I know the data types of my columns sum up to 32 bytes right now
> (was about 100 before). As I only see a reduction of relpages/reltuples
> by 30% not by a factor 3, I assume that the row overhead kicks in. The
> data definition of the new table looks like this:
> bigint REFERENCES test_orig(test_id) ON DELETE CASCADE
> bigint REFERENCES part_orig(prt_id) ON DELETE CASCADE
> smallint
> bit(16)
> real
> text (usually empty in most rows)
> smallint
> I did calculate 32 Bytes per row (if text is empty), but actually
> relpages/reltuples is about ~63 bytes. This would result in a per row
> overhead of 31 bytes. Would it change anything if I remove the 2 FOREIGN
> KEY constraints?
If you remove those columns entirely, sure. If you remove only the constraint,
no.

The row overhead in 8.3/8.4 is 28bytes afaik. You miss two points in your
calculation - one is alignment (i.e. a integer will only start at a 4byte
boundary) and the other is that for text you need to store the length of the
column as well.

> > Its pg_prepared_xacts (note the s), sorry my mind played me.
> Nothing inside this table as well. (I did also - while trying to improve
> postgresql.conf a few days ago - restart the server a couple of times, I
> think that would have removed any hanging transactions or prepares,
> shouldn't it?)
No, prepared transactions do not get removed by restarting. But thats fine
then.

> > > > > max_fsm_relations = 4194304             # min 100, ~70 bytes
fsm_relations is the max number of relations you want to store in the fsm -
currently that means you could have 4 mio tables+indexes.

> No, but it seems at least VACUUM is now running fine and no longer
> complaining about too small number for max_fsm_pages. Do you think if I
> reduce those two numbers, I'll have a better chance to run VACUUM FULL?
> Currently max_fsm_pages is slightly larger than relpages of my largest
> table. I read somewhere, max_fsm_pages should be about 1/2 of the total
> number of relpages in a DB, maybe another way to say it should be larger
> than the largest table...
The largest table does not really have any special influence on the fsm, so I
wouldnt count that rule as very good.
Its not that easy to calculate the size of the fsm correctly - thats why its
gone in 8.4...

I know of several instances running with a larger fsm_pages - you could try to
reduce the fsm_relations setting - I dont know if there are problems lurking
with such a oversized value.

I actually doubt that thats related to the oom youre seeing though - whats
your "maintenance_work_mem" setting and whats your
/proc/sys/vm/overcommit_ratio and how much swap do you have?

Andres

Re: Large DB, several tuning questions: Index sizes, VACUUM, REINDEX, Autovacuum

From

Scott Marlowe

Date:

06 December 2009, 16:24:16

On Sun, Dec 6, 2009 at 12:09 PM, Andres Freund <andres@anarazel.de> wrote:
> I know of several instances running with a larger fsm_pages - you could try to
> reduce the fsm_relations setting - I dont know if there are problems lurking
> with such a oversized value.

I run a db with 10M max_fsm_pages and 500k max_fam_relations.  We use
about 4.5M pages and only 1200 or so relations.  But we HAVE many more
relations than that, in the 40k range, so the higher number for max
relations is to make sure that if all those start getting updated we
can track them too.