Thread: database 1.2G, pg_dump 73M?!

database 1.2G, pg_dump 73M?!

From
Ross Boylan
Date:
I have a postgres server for which du reports
1188072    /var/lib/postgresql/8.2/main
on  Linux system.
The server has only one real database, which is for bacula.  When I dump
the database, it's 73Mg.

This is immediately after I did a full vacuum and restarted the server.

Also,
bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
limit 15;
             relname             | relpages
---------------------------------+----------
 file_jpfid_idx                  |    27122
 file_pathid_idx                 |    17969
 file_jobid_idx                  |    17948
 file_pkey                       |    14580
 file_fp_idx                     |    12714
 file                            |    11558
 file_filenameid_idx             |     9806
 filename                        |     3958
 filename_name_idx               |     2510
 filename_pkey                   |     1367
 path                            |      966
 path_name_idx                   |      950
 path_pkey                       |      151
 pg_attribute_relid_attnam_index |       46
 pg_proc                         |       45

It seems very strange to me that there is such a difference in size
between the dump and the database: the data store is almost 15 time
larger than the dump.

Is this to be expected (e.g., from the indices taking up disk space)?
Is there anything I can do to reclaim some disk space?

Thanks.
Ross


Re: database 1.2G, pg_dump 73M?!

From
Joris Dobbelsteen
Date:
Ross Boylan wrote:
> I have a postgres server for which du reports
> 1188072    /var/lib/postgresql/8.2/main
> on  Linux system.
> The server has only one real database, which is for bacula.  When I dump
> the database, it's 73Mg.
>
> This is immediately after I did a full vacuum and restarted the server.
>
> Also,
> bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
> limit 15;
>              relname             | relpages
> ---------------------------------+----------
>  file_jpfid_idx                  |    27122
>  file_pathid_idx                 |    17969
>  file_jobid_idx                  |    17948
>  file_pkey                       |    14580
>  file_fp_idx                     |    12714
>  file                            |    11558
>  file_filenameid_idx             |     9806
>  filename                        |     3958
>  filename_name_idx               |     2510
>  filename_pkey                   |     1367
>  path                            |      966
>  path_name_idx                   |      950
>  path_pkey                       |      151
>  pg_attribute_relid_attnam_index |       46
>  pg_proc                         |       45
>
> It seems very strange to me that there is such a difference in size
> between the dump and the database: the data store is almost 15 time
> larger than the dump.
>
> Is this to be expected (e.g., from the indices taking up disk space)?
> Is there anything I can do to reclaim some disk space
There are a few factors you need to take into account:

    * Data storage in the database is packed into blocks and contains
      header data. Since data needs to be put into blocks there is a
      potential for waisting space. If you are unlucky it can become
      nearly a single row in the worst case.
    * You need to vacuum often, to ensure obsolete rows are removed and
      space can be reused.
    * Tables are not reduced in size and only grown. I thinks cluster
      and vacuum full will reduce the size of your table.
    * Indexes are not in the backup, they are derived from the table
      data on a restore.
      If you remove the indexes you are left with 150~200 MB of data (I
      guessed).
      Doing reindex will rebuild the index and get rid of all the bloat
      it has been collected during use.  _I recommend you try this_, as
      your indexes on the file table look quite huge.

But the most important factor for you will be the following:

    * Backups are compressed. Since you store filenames and paths these
      will have a very high amount of regularity and therefore are very
      good targets for compression. This can save a huge amount of data.
      If you take a compression factor of 50%~70% you will reach your 70 MB.

Ow, server restarts will not help reduce your database size. In fact,
nothing at all should change, except lower performance until sufficient
cached data is back in the cache again.

Hope this helps...

- Joris

Re: database 1.2G, pg_dump 73M?!

From
Ross Boylan
Date:
On Sun, 2008-03-30 at 20:27 +0200, Joris Dobbelsteen wrote:
> Ross Boylan wrote:
> > I have a postgres server for which du reports
> > 1188072    /var/lib/postgresql/8.2/main
> > on  Linux system.
> > The server has only one real database, which is for bacula.  When I dump
> > the database, it's 73Mg.
> >
> > This is immediately after I did a full vacuum and restarted the server.
> >
> > Also,
> > bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
> > limit 15;
> >              relname             | relpages
> > ---------------------------------+----------
> >  file_jpfid_idx                  |    27122
> >  file_pathid_idx                 |    17969
> >  file_jobid_idx                  |    17948
> >  file_pkey                       |    14580
> >  file_fp_idx                     |    12714
> >  file                            |    11558
> >  file_filenameid_idx             |     9806
> >  filename                        |     3958
> >  filename_name_idx               |     2510
> >  filename_pkey                   |     1367
> >  path                            |      966
> >  path_name_idx                   |      950
> >  path_pkey                       |      151
> >  pg_attribute_relid_attnam_index |       46
> >  pg_proc                         |       45
> >
> > It seems very strange to me that there is such a difference in size
> > between the dump and the database: the data store is almost 15 time
> > larger than the dump.
> >
> > Is this to be expected (e.g., from the indices taking up disk space)?
> > Is there anything I can do to reclaim some disk space
> There are a few factors you need to take into account:
>
>     * Data storage in the database is packed into blocks and contains
>       header data. Since data needs to be put into blocks there is a
>       potential for waisting space. If you are unlucky it can become
>       nearly a single row in the worst case.
>     * You need to vacuum often, to ensure obsolete rows are removed and
>       space can be reused.
>     * Tables are not reduced in size and only grown. I thinks cluster
>       and vacuum full will reduce the size of your table.
>     * Indexes are not in the backup, they are derived from the table
>       data on a restore.
>       If you remove the indexes you are left with 150~200 MB of data (I
>       guessed).
>       Doing reindex will rebuild the index and get rid of all the bloat
>       it has been collected during use.  _I recommend you try this_, as
>       your indexes on the file table look quite huge.
reindexing had a huge effect.  After reindex the top tables (file,
filename and path) I now see
 SELECT relname, relpages FROM pg_class ORDER BY relpages DESC limit 15;
             relname             | relpages
---------------------------------+----------
 file                            |    11558
 filename                        |     3958
 filename_name_idx               |     2383
 file_jpfid_idx                  |     2145
 file_fp_idx                     |     1787
 file_jobid_idx                  |     1427
 file_pathid_idx                 |     1427
 file_pkey                       |     1427
 file_filenameid_idx             |     1427
 filename_pkey                   |     1367
 path                            |      966
 path_name_idx                   |      871
 path_pkey                       |      151
 pg_attribute_relid_attnam_index |       46
 pg_proc                         |       45
and du now reports 451M.  That still seems a bit large, given the size
of the sql dump, but it's almost 2/3 lower than it was before.

Thanks so much!

I guess I need to figure out how to reindex automatically.
>
> But the most important factor for you will be the following:
>
>     * Backups are compressed. Since you store filenames and paths these
>       will have a very high amount of regularity and therefore are very
>       good targets for compression. This can save a huge amount of data.
>       If you take a compression factor of 50%~70% you will reach your 70 MB.
I don't see how this is relevant, since my dump file was plain text
(sql).


> Ow, server restarts will not help reduce your database size. In fact,
> nothing at all should change, except lower performance until sufficient
> cached data is back in the cache again.
>
> Hope this helps...

It was a huge help.
>
> - Joris

Re: database 1.2G, pg_dump 73M?!

From
Joris Dobbelsteen
Date:
Ross Boylan wrote:
> On Sun, 2008-03-30 at 20:27 +0200, Joris Dobbelsteen wrote:
>
>> Ross Boylan wrote:
>>
>>> I have a postgres server for which du reports
>>> 1188072    /var/lib/postgresql/8.2/main
>>> on  Linux system.
>>> The server has only one real database, which is for bacula.  When I dump
>>> the database, it's 73Mg.
>>>
>>> This is immediately after I did a full vacuum and restarted the server.
>>>
>>> Also,
>>> bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
>>> limit 15;
>>>              relname             | relpages
>>> ---------------------------------+----------
>>>  file_jpfid_idx                  |    27122
>>>  file_pathid_idx                 |    17969
>>>  file_jobid_idx                  |    17948
>>>  file_pkey                       |    14580
>>>  file_fp_idx                     |    12714
>>>  file                            |    11558
>>>  file_filenameid_idx             |     9806
>>>  filename                        |     3958
>>>  filename_name_idx               |     2510
>>>  filename_pkey                   |     1367
>>>  path                            |      966
>>>  path_name_idx                   |      950
>>>  path_pkey                       |      151
>>>  pg_attribute_relid_attnam_index |       46
>>>  pg_proc                         |       45
>>>
>>> It seems very strange to me that there is such a difference in size
>>> between the dump and the database: the data store is almost 15 time
>>> larger than the dump.
>>>
>>> Is this to be expected (e.g., from the indices taking up disk space)?
>>> Is there anything I can do to reclaim some disk space
>>>
>> There are a few factors you need to take into account:
>>
>>     * Data storage in the database is packed into blocks and contains
>>       header data. Since data needs to be put into blocks there is a
>>       potential for waisting space. If you are unlucky it can become
>>       nearly a single row in the worst case.
>>     * You need to vacuum often, to ensure obsolete rows are removed and
>>       space can be reused.
>>     * Tables are not reduced in size and only grown. I thinks cluster
>>       and vacuum full will reduce the size of your table.
>>     * Indexes are not in the backup, they are derived from the table
>>       data on a restore.
>>       If you remove the indexes you are left with 150~200 MB of data (I
>>       guessed).
>>       Doing reindex will rebuild the index and get rid of all the bloat
>>       it has been collected during use.  _I recommend you try this_, as
>>       your indexes on the file table look quite huge.
>>
> reindexing had a huge effect.  After reindex the top tables (file,
> filename and path) I now see
>  SELECT relname, relpages FROM pg_class ORDER BY relpages DESC limit 15;
>              relname             | relpages
> ---------------------------------+----------
>  file                            |    11558
>  filename                        |     3958
>  filename_name_idx               |     2383
>  file_jpfid_idx                  |     2145
>  file_fp_idx                     |     1787
>  file_jobid_idx                  |     1427
>  file_pathid_idx                 |     1427
>  file_pkey                       |     1427
>  file_filenameid_idx             |     1427
>  filename_pkey                   |     1367
>  path                            |      966
>  path_name_idx                   |      871
>  path_pkey                       |      151
>  pg_attribute_relid_attnam_index |       46
>  pg_proc                         |       45
> and du now reports 451M.  That still seems a bit large, given the size
> of the sql dump, but it's almost 2/3 lower than it was before.
>
> Thanks so much!
>
> I guess I need to figure out how to reindex automatically.
>
Take a threshold, e.g. look which indexes are towards the table size, or
something. The bloat is mostly causes by continues updates to the
indexes on every insert, update and delete command. The index needs to
split pages that might be merged back some time later. Doing frequent
vacuums might, or might not, prevent this. Even in theory you will see
that algorithms allow trees to grow quite large up to a certain constant
factor. This is in order to have a good limit on the amount of work that
must be done on a operation on the index.
>> But the most important factor for you will be the following:
>>
>>     * Backups are compressed. Since you store filenames and paths these
>>       will have a very high amount of regularity and therefore are very
>>       good targets for compression. This can save a huge amount of data.
>>       If you take a compression factor of 50%~70% you will reach your 70 MB.
>>
> I don't see how this is relevant, since my dump file was plain text
> (sql).
>
 From the top contenders, about half are indexes, so you are stuck with
~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the tuples in a
block and overhead for each block and row. I don't know those values,
but they are in the range of 24 bytes per tuple, I believe. Secondly a
block is 8 KB by default and tuples cannot be stored into multiple
blocks (thats what toast to work around).

All in all: Lookup tuple sizes, if they are small than the overhead from
postgresql can be a big factor. If you are huge you loose on portions of
unoccupied space in blocks. I believe pg_statistics will provide this
information.
Another factor is representation in the SQL dump might be more efficient
than in the database, but this highly depends on your data set. For
example, a int8 takes 8 bytes in a table, while it takes between 1 and
~20 in a SQL dump.

How the plain SQL dump becomes this small I cannot explain without much
much more details.
>> Hope this helps...
>>
>
> It was a huge help.
>
Glad it was,

- Joris

Re: database 1.2G, pg_dump 73M?!

From
Tom Lane
Date:
Ross Boylan <RossBoylan@stanfordalumni.org> writes:
> reindexing had a huge effect.

So the indexes were indeed bloated.  There are some known usage patterns
in which regular vacuum isn't very good at reclaiming space in b-tree
indexes.  For example if you make daily entries in an index by date and
later remove all but the last-of-the-month entry --- this leaves a few
entries on every index page and we don't have code to collapse that,
short of reindexing.

However what seems more likely is that you're getting burnt by excessive
use of VACUUM FULL.  V.F., far from shrinking indexes, tends to bloat
them.  Recommended practice is to use plain VACUUM often enough that you
don't need VACUUM FULL.

            regards, tom lane

Re: database 1.2G, pg_dump 73M?!

From
Ross Boylan
Date:
On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
>  From the top contenders, about half are indexes, so you are stuck
> with
> ~200 MB of data in the tables.
> Postgresql has some wasted space due to placement of the tuples in a
> block and overhead for each block and row. I don't know those values,
> but they are in the range of 24 bytes per tuple, I believe. Secondly
> a
> block is 8 KB by default and tuples cannot be stored into multiple
> blocks (thats what toast to work around).
>
> All in all: Lookup tuple sizes, if they are small than the overhead
> from
> postgresql can be a big factor. If you are huge you loose on portions
> of
> unoccupied space in blocks. I believe pg_statistics will provide this
> information.
There is a pg_statistic (no "s")  table, but I don't know how to get
tuple size from it--the documentation refers to the source code to
figure out the codes.  Backing up a step, I don't know what a tuple is
in Postgres, and don't see an entry for it in the index.

Is a tuple just a row?  That's what the docs say for the following
report:
# select distinct relname, reltuples, relpages from pg_class where
relkind='r' and substring(relname, 1, 3) != 'pg_';

         relname         | reltuples | relpages
-------------------------+-----------+----------
 basefiles               |         0 |        0
 cdimages                |         0 |        0
 client                  |         2 |        1
 counters                |         1 |        1
 device                  |         0 |        0
 file                    |    650659 |    11558
 filename                |    623012 |     3958
 fileset                 |        22 |        1
 job                     |       384 |       10
 jobmedia                |       596 |        7
 location                |         0 |        0
 locationlog             |         0 |        0
 log                     |         0 |        0
 media                   |       245 |        9
 mediatype               |         2 |        1
 path                    |     67908 |      966
 pool                    |         5 |        1
 sql_features            |       439 |        6
 sql_implementation_info |        12 |        1
 sql_languages           |         4 |        1
 sql_packages            |        10 |        1
 sql_parts               |         9 |        1
 sql_sizing              |        23 |        1
 sql_sizing_profiles     |         0 |        0
 status                  |        19 |        1
 storage                 |         2 |        1
 unsavedfiles            |         0 |        0
 version                 |         1 |        1

> Another factor is representation in the SQL dump might be more
> efficient
> than in the database, but this highly depends on your data set. For
> example, a int8 takes 8 bytes in a table, while it takes between 1
> and
> ~20 in a SQL dump.
>
> How the plain SQL dump becomes this small I cannot explain without
> much
> much more details.

On Tom's point, bacula regularly inserts entries into the tables and
then, days to months later, deletes them.  As far as I know, the VACUUM
FULLs I just did were the first ever; I did do several of them because I
kept getting messages about needing more fsm_pages.

I am still trying to figure out if the database was getting any
automatic vacuuming at all.  The Postgres documentation (the database is
8.2, though I'm moving to 8.3 soon) sounds as if it's on automatically,
but the Debian-specific documentation suggests I may need to do some
additional things to enable it.

Probably the fsm_pages being low also hurt disk useage, since the
message accompanying the vacuum said that's what happens if fsm_pages
is low.  It was 20k; vacuum said I needed 56k, and I upped it to 80k.
I'm not sure if my recent cleaning has brought the needed fsm_pages
down.

I've only been doing partial backups for the last few months, so there's
probably less info in the tables than under normal conditions.  I
suppose it's possible the space I gained was just a temporary win.

Ross

Re: database 1.2G, pg_dump 73M?!

From
"Joris Dobbelsteen"
Date:
>-----Original Message-----
>From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
>Sent: Sunday, 30 March 2008 23:43
>To: Joris Dobbelsteen
>Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!
>
>On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
>>  From the top contenders, about half are indexes, so you are stuck
>> with ~200 MB of data in the tables.
>> Postgresql has some wasted space due to placement of the tuples in a
>> block and overhead for each block and row. I don't know
>those values,
>> but they are in the range of 24 bytes per tuple, I believe.
>Secondly a
>> block is 8 KB by default and tuples cannot be stored into multiple
>> blocks (thats what toast to work around).
>>
>> All in all: Lookup tuple sizes, if they are small than the overhead
>> from postgresql can be a big factor. If you are huge you loose on
>> portions of unoccupied space in blocks. I believe pg_statistics will
>> provide this information.
>There is a pg_statistic (no "s")  table, but I don't know how
>to get tuple size from it--the documentation refers to the
>source code to figure out the codes.  Backing up a step, I
>don't know what a tuple is in Postgres, and don't see an entry
>for it in the index.

It was pg_stats.
You get avg_width. It gives this per column.

So probably you want
SELECT tablename, SUM(avg_width)
FROM pg_stats
WHERE schemaname = 'public'
GROUP BY tablename;

>Is a tuple just a row?  That's what the docs say for the following
>report:

Yes.
Also where I typed "block" it will refer to "page".

># select distinct relname, reltuples, relpages from pg_class
>where relkind='r' and substring(relname, 1, 3) != 'pg_';
>
>         relname         | reltuples | relpages
>-------------------------+-----------+----------
> basefiles               |         0 |        0
> cdimages                |         0 |        0
> client                  |         2 |        1
> counters                |         1 |        1
> device                  |         0 |        0
> file                    |    650659 |    11558
> filename                |    623012 |     3958
> fileset                 |        22 |        1
> job                     |       384 |       10
> jobmedia                |       596 |        7
> location                |         0 |        0
> locationlog             |         0 |        0
> log                     |         0 |        0
> media                   |       245 |        9
> mediatype               |         2 |        1
> path                    |     67908 |      966
> pool                    |         5 |        1
> sql_features            |       439 |        6
> sql_implementation_info |        12 |        1
> sql_languages           |         4 |        1
> sql_packages            |        10 |        1
> sql_parts               |         9 |        1
> sql_sizing              |        23 |        1
> sql_sizing_profiles     |         0 |        0
> status                  |        19 |        1
> storage                 |         2 |        1
> unsavedfiles            |         0 |        0
> version                 |         1 |        1

Do relpages * 8096 / reltuples.
Default installs have 8 KB pages/blocks by default.

For file it should be ~144 bytes/tuple.
For filename it should be ~51 bytes/tuple.

Probably you will get some signficant differences here.

>> Another factor is representation in the SQL dump might be more
>> efficient than in the database, but this highly depends on your data
>> set. For example, a int8 takes 8 bytes in a table, while it takes
>> between 1 and ~20 in a SQL dump.
>>
>> How the plain SQL dump becomes this small I cannot explain without
>> much much more details.
>
>On Tom's point, bacula regularly inserts entries into the
>tables and then, days to months later, deletes them.  As far
>as I know, the VACUUM FULLs I just did were the first ever; I
>did do several of them because I kept getting messages about
>needing more fsm_pages.
>
>I am still trying to figure out if the database was getting
>any automatic vacuuming at all.  The Postgres documentation
>(the database is 8.2, though I'm moving to 8.3 soon) sounds as
>if it's on automatically, but the Debian-specific
>documentation suggests I may need to do some additional things
>to enable it.
>
>Probably the fsm_pages being low also hurt disk useage, since
>the message accompanying the vacuum said that's what happens
>if fsm_pages is low.  It was 20k; vacuum said I needed 56k,
>and I upped it to 80k.
>I'm not sure if my recent cleaning has brought the needed
>fsm_pages down.
>
>I've only been doing partial backups for the last few months,
>so there's probably less info in the tables than under normal
>conditions.  I suppose it's possible the space I gained was
>just a temporary win.

Looks like some configuration changes are needed to tune your
installation to better suite you needs. fsm_pages can be modified, but I
don't know if any other parameters might need change for that.
Documentation can help and probably a lot of people here can do and tell
it blindfolded.

- Joris

Re: database 1.2G, pg_dump 73M?!

From
Ross Boylan
Date:
On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:
> >-----Original Message-----
> >From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
> >Sent: Sunday, 30 March 2008 23:43
> >To: Joris Dobbelsteen
> >Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
> >Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!
> >
> >On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
> >>  From the top contenders, about half are indexes, so you are stuck
> >> with ~200 MB of data in the tables.
> >> Postgresql has some wasted space due to placement of the tuples in a
> >> block and overhead for each block and row. I don't know
> >those values,
> >> but they are in the range of 24 bytes per tuple, I believe.
> >Secondly a
> >> block is 8 KB by default and tuples cannot be stored into multiple
> >> blocks (thats what toast to work around).
> >>
> >> All in all: Lookup tuple sizes, if they are small than the overhead
> >> from postgresql can be a big factor. If you are huge you loose on
> >> portions of unoccupied space in blocks. I believe pg_statistics will
> >> provide this information.
> >There is a pg_statistic (no "s")  table, but I don't know how
> >to get tuple size from it--the documentation refers to the
> >source code to figure out the codes.  Backing up a step, I
> >don't know what a tuple is in Postgres, and don't see an entry
> >for it in the index.
>
> It was pg_stats.
> You get avg_width. It gives this per column.
>
> So probably you want
> SELECT tablename, SUM(avg_width)
> FROM pg_stats
> WHERE schemaname = 'public'
[Thanks; I didn't know about using schemaname to limit it to interesting
tables]
> GROUP BY tablename [RB added] ORDER BY tablename;
 tablename | sum
-----------+-----
 client    |  62
 counters  |  25
 file      | 109
 filename  |  18
 fileset   |  53
 job       | 149
 jobmedia  |  52
 media     | 226
 mediatype |  16
 path      |  82
 pool      | 179
 status    |  29
 storage   |  23
 version   |   4

So, for example, if each tuple has 24 bytes of overhead, the overhead
more than doubles the size of the file table (18 bytes), which has a big
record count.  Am I following correctly?

Between the space taken up by indices and the other overhead, the size
difference between the sql dump and the db disk useage is starting to
seem more reasonable.

The database uses SQL-ASCII encoding, so I'd expect the textual parts
(filenames and paths) to take up the same space (just for the basic
storage, not counting overhead/indices) in Postgres as on the dump, 1
byte/character.

I'm not sure what "If you are huge you loose on portions of unoccupied
space in blocks" means.
>
> >Is a tuple just a row?  That's what the docs say for the following
> >report:
>
> Yes.
> Also where I typed "block" it will refer to "page".
>
> ># select distinct relname, reltuples, relpages from pg_class
> >where relkind='r' and substring(relname, 1, 3) != 'pg_';
> >
> >         relname         | reltuples | relpages
> >-------------------------+-----------+----------
> > basefiles               |         0 |        0
> > cdimages                |         0 |        0
> > client                  |         2 |        1
> > counters                |         1 |        1
> > device                  |         0 |        0
> > file                    |    650659 |    11558
> > filename                |    623012 |     3958
> > fileset                 |        22 |        1
> > job                     |       384 |       10
> > jobmedia                |       596 |        7
> > location                |         0 |        0
> > locationlog             |         0 |        0
> > log                     |         0 |        0
> > media                   |       245 |        9
> > mediatype               |         2 |        1
> > path                    |     67908 |      966
> > pool                    |         5 |        1
> > sql_features            |       439 |        6
> > sql_implementation_info |        12 |        1
> > sql_languages           |         4 |        1
> > sql_packages            |        10 |        1
> > sql_parts               |         9 |        1
> > sql_sizing              |        23 |        1
> > sql_sizing_profiles     |         0 |        0
> > status                  |        19 |        1
> > storage                 |         2 |        1
> > unsavedfiles            |         0 |        0
> > version                 |         1 |        1
>
> Do relpages * 8096 / reltuples.
> Default installs have 8 KB pages/blocks by default.
>
> For file it should be ~144 bytes/tuple.
> For filename it should be ~51 bytes/tuple.

Those tuple size estimates seem consistent with the tuple sizes reported
earlier when the overhead is added in.  I'm impressed!
>
> Probably you will get some signficant differences here.
>
> >> Another factor is representation in the SQL dump might be more
> >> efficient than in the database, but this highly depends on your data
> >> set. For example, a int8 takes 8 bytes in a table, while it takes
> >> between 1 and ~20 in a SQL dump.
> >>
> >> How the plain SQL dump becomes this small I cannot explain without
> >> much much more details.
> >
> >On Tom's point, bacula regularly inserts entries into the
> >tables and then, days to months later, deletes them.  As far
> >as I know, the VACUUM FULLs I just did were the first ever; I
> >did do several of them because I kept getting messages about
> >needing more fsm_pages.
> >
> >I am still trying to figure out if the database was getting
> >any automatic vacuuming at all.  The Postgres documentation
> >(the database is 8.2, though I'm moving to 8.3 soon) sounds as
> >if it's on automatically, but the Debian-specific
> >documentation suggests I may need to do some additional things
> >to enable it.
> >
> >Probably the fsm_pages being low also hurt disk useage, since
> >the message accompanying the vacuum said that's what happens
> >if fsm_pages is low.  It was 20k; vacuum said I needed 56k,
> >and I upped it to 80k.
> >I'm not sure if my recent cleaning has brought the needed
> >fsm_pages down.
> >
> >I've only been doing partial backups for the last few months,
> >so there's probably less info in the tables than under normal
> >conditions.  I suppose it's possible the space I gained was
> >just a temporary win.
>
> Looks like some configuration changes are needed to tune your
> installation to better suite you needs. fsm_pages can be modified,
done
>  but I
> don't know if any other parameters might need change for that.
I didn't seem to bump into any system limits; I had already upped the
overall shared memory limit a bit.
> Documentation can help and probably a lot of people here can do and tell
> it blindfolded.

I'm not a DB admin; I only play one on my computer.  I clearly need to
figure out how to get regular vacuum, analyze, and reindexing going (if
they aren't going already).
>
Thanks for all your help.
Ross

Re: database 1.2G, pg_dump 73M?!

From
"Joris Dobbelsteen"
Date:
>-----Original Message-----
>From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
>Sent: Monday, 31 March 2008 0:23
>To: Joris Dobbelsteen
>Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
>Subject: RE: [GENERAL] database 1.2G, pg_dump 73M?!
>
>
>On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:
>> >-----Original Message-----
>> >From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
>> >Sent: Sunday, 30 March 2008 23:43
>> >To: Joris Dobbelsteen
>> >Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
>> >Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!
>> >
>> >On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:
>> >>  From the top contenders, about half are indexes, so you
>are stuck
>> >> with ~200 MB of data in the tables.
>> >> Postgresql has some wasted space due to placement of the
>tuples in
>> >> a block and overhead for each block and row. I don't know
>> >those values,
>> >> but they are in the range of 24 bytes per tuple, I believe.
>> >Secondly a
>> >> block is 8 KB by default and tuples cannot be stored into
>multiple
>> >> blocks (thats what toast to work around).
>> >>
>> >> All in all: Lookup tuple sizes, if they are small than
>the overhead
>> >> from postgresql can be a big factor. If you are huge you loose on
>> >> portions of unoccupied space in blocks. I believe pg_statistics
>> >> will provide this information.
>> >There is a pg_statistic (no "s")  table, but I don't know
>how to get
>> >tuple size from it--the documentation refers to the source code to
>> >figure out the codes.  Backing up a step, I don't know what a tuple
>> >is in Postgres, and don't see an entry for it in the index.
>>
>> It was pg_stats.
>> You get avg_width. It gives this per column.
>>
>> So probably you want
>> SELECT tablename, SUM(avg_width)
>> FROM pg_stats
>> WHERE schemaname = 'public'
>[Thanks; I didn't know about using schemaname to limit it to
>interesting tables]
>> GROUP BY tablename [RB added] ORDER BY tablename;
> tablename | sum
>-----------+-----
> client    |  62
> counters  |  25
> file      | 109
> filename  |  18
> fileset   |  53
> job       | 149
> jobmedia  |  52
> media     | 226
> mediatype |  16
> path      |  82
> pool      | 179
> status    |  29
> storage   |  23
> version   |   4
>
>So, for example, if each tuple has 24 bytes of overhead, the
>overhead more than doubles the size of the file table (18
>bytes), which has a big record count.  Am I following correctly?

Yes.
(Note not to pin down on the 24 bytes, it varies between versions. I
think this is close enough however).

>Between the space taken up by indices and the other overhead,
>the size difference between the sql dump and the db disk
>useage is starting to seem more reasonable.
>
>The database uses SQL-ASCII encoding, so I'd expect the
>textual parts (filenames and paths) to take up the same space
>(just for the basic storage, not counting overhead/indices) in
>Postgres as on the dump, 1 byte/character.

That's the idea.

>I'm not sure what "If you are huge you loose on portions of
>unoccupied space in blocks" means.

A tuple has to be in exactly 1 page (or block). It cannot span multiple
pages. Hence if your tuples happens to be a half page size or slightly
larger, than only a single one will fit in a single page. So you waste
half the page. You don't have to worry about this, as your tuples are
significantly smaller than that.

(As a reference only: if a tuple is larger than a blocksize some
attributes will be evicted to the toast table or compressed. So that is
to overcome the limitations of 8000 bytes per tuple.)

[snip]
>>
>> ># select distinct relname, reltuples, relpages from pg_class where
>> >relkind='r' and substring(relname, 1, 3) != 'pg_';
>> >
>> >         relname         | reltuples | relpages
>> >-------------------------+-----------+----------
>> > basefiles               |         0 |        0
>> > cdimages                |         0 |        0
>> > client                  |         2 |        1
>> > counters                |         1 |        1
>> > device                  |         0 |        0
>> > file                    |    650659 |    11558
>> > filename                |    623012 |     3958
>> > fileset                 |        22 |        1
>> > job                     |       384 |       10
>> > jobmedia                |       596 |        7
>> > location                |         0 |        0
>> > locationlog             |         0 |        0
>> > log                     |         0 |        0
>> > media                   |       245 |        9
>> > mediatype               |         2 |        1
>> > path                    |     67908 |      966
>> > pool                    |         5 |        1
>> > sql_features            |       439 |        6
>> > sql_implementation_info |        12 |        1
>> > sql_languages           |         4 |        1
>> > sql_packages            |        10 |        1
>> > sql_parts               |         9 |        1
>> > sql_sizing              |        23 |        1
>> > sql_sizing_profiles     |         0 |        0
>> > status                  |        19 |        1
>> > storage                 |         2 |        1
>> > unsavedfiles            |         0 |        0
>> > version                 |         1 |        1
>>
>> Do relpages * 8096 / reltuples.
>> Default installs have 8 KB pages/blocks by default.
>>
>> For file it should be ~144 bytes/tuple.
>> For filename it should be ~51 bytes/tuple.
>
>Those tuple size estimates seem consistent with the tuple
>sizes reported earlier when the overhead is added in.  I'm impressed!
>
[snip]

- Joris

Re: database 1.2G, pg_dump 73M?!

From
Dragan Zubac
Date:
Ross Boylan wrote:
I'm not a DB admin; I only play one on my computer.  I clearly need to
figure out how to get regular vacuum, analyze, and reindexing going (if
they aren't going already). 
Thanks for all your help.
Ross
 
1. optimize your sql queries and 'understand' index usage,don't index everything because 'somebody told You indexes are good to speed things up',for example

db=> explain analyze select email from users where email='abc@abc.com';
                                            QUERY PLAN                                           
--------------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..652.40 rows=1 width=42) (actual time=3.467..3.467 rows=0 loops=1)
   Filter: (email = 'abc@abc.com'::text)
 Total runtime: 3.497 ms

now we add index on table 'users' column 'email':

db=> create index users_email_idx on users (email);

repeat the query:

db=> explain analyze select email from users where email='abc@abc.com';
                                                   QUERY PLAN                                                   
-----------------------------------------------------------------------------------------------------------------
 Index Scan using us_email on users  (cost=0.00..8.05 rows=1 width=42) (actual time=0.113..0.113 rows=0 loops=1)
   Index Cond: (email = 'abc@abc.com'::text)
 Total runtime: 0.152 ms
(3 rows)

compare only 'Total runtime' for start: 0.152ms from second query comparing to 3.497ms from first query,guess we have some improvements ?

Always try to use 'explain analyze' commands to see if there's any difference in any change You do in Your database

2. do 'vacuum verbose analyze table' occasionally,depending on Your database usage,meanining if You're deleting/updating stuff like that frequently,You will have to do more often 'vaccum'. If You import all Your data and You basically do 'select',meaning reading data,do 'vacuum' on New Years Eve :)

3. You want to know the size of You tables,indexes ? You want to know about their size before and after 'vacuum' command ? Example:

db=> select * from pg_size_pretty(pg_relation_size('users'));
 pg_size_pretty
----------------
 5496 kB
(1 row)

then we do the ordinary 'vacuum':

sms=> vacuum users;
VACUUM
db=> select * from pg_size_pretty(pg_relation_size('users'));
 pg_size_pretty
----------------
 5520 kB
(1 row)

not much improvement ,ok we're going do the full 'vacuum':

db=> vacuum full users;
VACUUM
db=> select * from pg_size_pretty(pg_relation_size('users'));
 pg_size_pretty
----------------
 80 kB
(1 row)

How about indexes ? Example:

db=> select * from pg_size_pretty(pg_relation_size('users_pkey'));
 pg_size_pretty
----------------
 192 kB
(1 row)

for indexes there're not 'vacuum',but 'reindex' command,like :

db=> REINDEX INDEX users_pkey ;
REINDEX

we look at the index size after 'reindex' command:

db=> select * from pg_size_pretty(pg_relation_size('users_pkey'));
 pg_size_pretty
----------------
 16 kB
(1 row)


4. don't bother to 'analyze' anything :) if all those numbers and stuff doesn't mean anything to You,guess Your life would be easier than trying to apprehend all those 'analyzes' :) just do the regularly 'clean-up' on Your DB,and You should be fine.

Unfortunately,PostgreSQL is not a database meant for people looking for a 'black box' database solution,meaning plug-in-electricity-and-forget. But,if You belong to that group of people who are willing to understand better what their database solutions is up to and maximize the proportion 'what do I get'/'for how much money',You're welcome by all means :)

Sincerely

Dragan




Re: database 1.2G, pg_dump 73M?!

From
"Anton Melser"
Date:
>  I am still trying to figure out if the database was getting any
>  automatic vacuuming at all.  The Postgres documentation (the database is
>  8.2, though I'm moving to 8.3 soon) sounds as if it's on automatically,
>  but the Debian-specific documentation suggests I may need to do some
>  additional things to enable it.

This is probably far too late but anyway. You do indeed have to enable
autovacuum with 8.2, as it isn't enabled by default, at least with
most distributions. 8.3 it's by default with most distributions.
# show autovacuum;
will tell you if it's on.
Cheers
Anton