Thread: strange table disk sizes

strange table disk sizes

From
Rik Bellens
Date:
Hello,

I have two large tables in a database, one containing original data and
the other one derived from the first table. The first table contains
several columns and indexes, while the second table has less columns and
only one index. Both tables have the same number of rows. Nevertheless,
the second table is much larger in disk size than the first one. How can
this be explained?

The tables are defined below. The first table 'measurement' contains
information about measurements by different devices on different times.
The second table 'stats_count' is created in order to be able to quickly
query the number of measurements for a particular device in an arbitrary
time interval. To accomplish this the second table contains for each
measurement a cumulative count of the number of measurements up until
that time. The table is kept synchronised with the first table with
triggers.

The first table contains 4 integer columns, one enum column and one time
column, which would normally take 28 bytes for each row. The table
contains about 50 million rows. This means about 1.5 gigabyte of data.
The actual disk size of this table (without indexes) is 3 gigabyte,
which is propably explainable because of some overhead. The disk size
with indexes is about 9 gigabyte.

The second table on the other hand only contains 3 integer columns and
one time column. Nevertheless the size on disk of this table is about 14
gigabyte without indexes. There is one index on this table which uses an
extra 4 gigabyte. Can anyone explain this?

One possible reason is that the primary key on table measurement is an
integer, while the primary key on table stats_count is a combination of
two integers and a time. Logically it requires more disk space to store
the second key, however the same two integers and time also form an
index in the first table, so should take the same amount of disk space
on the first table. Or does a primary key take more disk space than
another index or unique constraint? Would it therefore be more efficient
(=less disk space) to use an extra id column in the table stats_count as
primary key and the combination (deviceid, type, time) as a regular
index? And if so, will a query be still as fast (I can imagine that if
an index uses less disk space, it will not be as fast).

Another possible reason could be the foreign key that is present in the
table stats_count that references the combination (deviceid, type, time)
in table measurements. Does this foreign key take that much disk space?

Any help would be greatly appreciated,
Kind regards,
Rik


CREATE TABLE measurement
(
   id serial NOT NULL,
   deviceid integer NOT NULL,
   "time" timestamp without time zone NOT NULL,
   javaobjectid integer NOT NULL,
   javaobjectsubid integer NOT NULL,
   "type" measurementtype NOT NULL,
   CONSTRAINT measurement_pkey PRIMARY KEY (id),
   CONSTRAINT measurement_deviceid_fkey FOREIGN KEY (deviceid)
       REFERENCES device (id) MATCH SIMPLE
       ON UPDATE NO ACTION ON DELETE NO ACTION,
   CONSTRAINT measurement_javaobject_unique UNIQUE (javaobjectid,
javaobjectsubid),
   CONSTRAINT measurement_unique UNIQUE (type, deviceid, "time")
)

CREATE INDEX measurement_javaobjectsubid_index
   ON measurement
   USING btree
   (javaobjectsubid);

CREATE TABLE stats_count
(
   deviceid integer NOT NULL,
   "time" timestamp without time zone NOT NULL,
   "type" measurementtype NOT NULL,
   cum_count integer,
   CONSTRAINT stats_count_pkey PRIMARY KEY (type, deviceid, "time"),
   CONSTRAINT stats_count_type_fkey FOREIGN KEY ("type", deviceid, "time")
       REFERENCES measurement ("type", deviceid, "time") MATCH SIMPLE
       ON UPDATE NO ACTION ON DELETE NO ACTION
)

Re: strange table disk sizes

From
Scott Marlowe
Date:
On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote:
>
> Hello,
>
> I have two large tables in a database, one containing original data and the
> other one derived from the first table. The first table contains several
> columns and indexes, while the second table has less columns and only one
> index. Both tables have the same number of rows. Nevertheless, the second
> table is much larger in disk size than the first one. How can this be
> explained?

This is most likely due to table bloat.  In PostgreSQL when you update
or delete a row, a dead version gets left behind.  Vacuum eventually
comes along and reclaims the empty space to be reused.  If you delete
/ update a LOT of rows at once, then you'll have a lot of dead rows
which can only be reused after vacuuming when you do more updates or
deletes later on.

A few salient questions. What version of PostgreSQL are you running?
Is autovacuum running? Do you do a LOT of bulk deletes / updates?  If
you do a lot of bulk deletes on this table, and you delete everything,
can you switch to using the truncate command instead?

Re: strange table disk sizes

From
Rik Bellens
Date:
Op 01-09-11 13:31, Scott Marlowe schreef:
> On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens<rik.bellens@telin.ugent.be>  wrote:
>> Hello,
>>
>> I have two large tables in a database, one containing original data and the
>> other one derived from the first table. The first table contains several
>> columns and indexes, while the second table has less columns and only one
>> index. Both tables have the same number of rows. Nevertheless, the second
>> table is much larger in disk size than the first one. How can this be
>> explained?
> This is most likely due to table bloat.  In PostgreSQL when you update
> or delete a row, a dead version gets left behind.  Vacuum eventually
> comes along and reclaims the empty space to be reused.  If you delete
> / update a LOT of rows at once, then you'll have a lot of dead rows
> which can only be reused after vacuuming when you do more updates or
> deletes later on.
>
> A few salient questions. What version of PostgreSQL are you running?
> Is autovacuum running? Do you do a LOT of bulk deletes / updates?  If
> you do a lot of bulk deletes on this table, and you delete everything,
> can you switch to using the truncate command instead?

I use version 8.3. I see the 'autovacuum launcher process' and
'autovacuum worker process' in the process list, so I suppose autovacuum
is running.

Rows in the measurement table are added once and never deleted or
updated. Adding a row to this table triggers a function that adds a row
to the stats_count table. Normally rows are added chronologically. So
rows in the stats_count table are normally not updated either. If
however, for some reason, a measurement is added from an older time, all
rows of that device which come after this time, are updated, but I don't
think this will happen very often.

The table 'stats_count' was created in a later stage, so the first 45M
rows were added at once and chronologically. However, because the
function to initialize this table took a long time and the client
application crashed a few times, I had to restart this function several
times. Can it be that there is some trash left from running this
function several times without finishing it? Would it be a solution to
run 'VACUUM FULL' to reclaim some disk space?




Re: strange table disk sizes

From
Rik Bellens
Date:
Op 01-09-11 14:22, Scott Marlowe schreef:
> On Thu, Sep 1, 2011 at 6:05 AM, Rik Bellens<rik.bellens@telin.ugent.be>  wrote:
>> Op 01-09-11 13:31, Scott Marlowe schreef:
>>> On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens<rik.bellens@telin.ugent.be>
>>>   wrote:
>>>> Hello,
>>>>
>>>> I have two large tables in a database, one containing original data and
>>>> the
>>>> other one derived from the first table. The first table contains several
>>>> columns and indexes, while the second table has less columns and only one
>>>> index. Both tables have the same number of rows. Nevertheless, the second
>>>> table is much larger in disk size than the first one. How can this be
>>>> explained?
>>> This is most likely due to table bloat.  In PostgreSQL when you update
>>> or delete a row, a dead version gets left behind.  Vacuum eventually
>>> comes along and reclaims the empty space to be reused.  If you delete
>>> / update a LOT of rows at once, then you'll have a lot of dead rows
>>> which can only be reused after vacuuming when you do more updates or
>>> deletes later on.
>>>
>>> A few salient questions. What version of PostgreSQL are you running?
>>> Is autovacuum running? Do you do a LOT of bulk deletes / updates?  If
>>> you do a lot of bulk deletes on this table, and you delete everything,
>>> can you switch to using the truncate command instead?
>> I use version 8.3. I see the 'autovacuum launcher process' and 'autovacuum
>> worker process' in the process list, so I suppose autovacuum is running.
>>
>> Rows in the measurement table are added once and never deleted or updated.
>> Adding a row to this table triggers a function that adds a row to the
>> stats_count table. Normally rows are added chronologically. So rows in the
>> stats_count table are normally not updated either. If however, for some
>> reason, a measurement is added from an older time, all rows of that device
>> which come after this time, are updated, but I don't think this will happen
>> very often.
>>
>> The table 'stats_count' was created in a later stage, so the first 45M rows
>> were added at once and chronologically. However, because the function to
>> initialize this table took a long time and the client application crashed a
>> few times, I had to restart this function several times. Can it be that
>> there is some trash left from running this function several times without
>> finishing it? Would it be a solution to run 'VACUUM FULL' to reclaim some
>> disk space?
> Yeah, could be.  Take a look at this page:
> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the
> query there sheds some light on your situ.
thanks for this answer

if i run the query, I get 12433752064 wasted bytes on stats_count_pkey,
so I suppose that is the reason


Re: strange table disk sizes

From
Scott Marlowe
Date:
On Thu, Sep 1, 2011 at 6:38 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote:
> Op 01-09-11 14:22, Scott Marlowe schreef:
>>
>> On Thu, Sep 1, 2011 at 6:05 AM, Rik Bellens<rik.bellens@telin.ugent.be>
>>  wrote:
>>>
>>> Op 01-09-11 13:31, Scott Marlowe schreef:
>>>>
>>>> On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens<rik.bellens@telin.ugent.be>
>>>>  wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I have two large tables in a database, one containing original data and
>>>>> the
>>>>> other one derived from the first table. The first table contains
>>>>> several
>>>>> columns and indexes, while the second table has less columns and only
>>>>> one
>>>>> index. Both tables have the same number of rows. Nevertheless, the
>>>>> second
>>>>> table is much larger in disk size than the first one. How can this be
>>>>> explained?
>>>>
>>>> This is most likely due to table bloat.  In PostgreSQL when you update
>>>> or delete a row, a dead version gets left behind.  Vacuum eventually
>>>> comes along and reclaims the empty space to be reused.  If you delete
>>>> / update a LOT of rows at once, then you'll have a lot of dead rows
>>>> which can only be reused after vacuuming when you do more updates or
>>>> deletes later on.
>>>>
>>>> A few salient questions. What version of PostgreSQL are you running?
>>>> Is autovacuum running? Do you do a LOT of bulk deletes / updates?  If
>>>> you do a lot of bulk deletes on this table, and you delete everything,
>>>> can you switch to using the truncate command instead?
>>>
>>> I use version 8.3. I see the 'autovacuum launcher process' and
>>> 'autovacuum
>>> worker process' in the process list, so I suppose autovacuum is running.
>>>
>>> Rows in the measurement table are added once and never deleted or
>>> updated.
>>> Adding a row to this table triggers a function that adds a row to the
>>> stats_count table. Normally rows are added chronologically. So rows in
>>> the
>>> stats_count table are normally not updated either. If however, for some
>>> reason, a measurement is added from an older time, all rows of that
>>> device
>>> which come after this time, are updated, but I don't think this will
>>> happen
>>> very often.
>>>
>>> The table 'stats_count' was created in a later stage, so the first 45M
>>> rows
>>> were added at once and chronologically. However, because the function to
>>> initialize this table took a long time and the client application crashed
>>> a
>>> few times, I had to restart this function several times. Can it be that
>>> there is some trash left from running this function several times without
>>> finishing it? Would it be a solution to run 'VACUUM FULL' to reclaim some
>>> disk space?
>>
>> Yeah, could be.  Take a look at this page:
>> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the
>> query there sheds some light on your situ.
>
> thanks for this answer
>
> if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I
> suppose that is the reason

Sounds like it.  Take a look here:
http://wiki.postgresql.org/wiki/Index_Maintenance

Re: strange table disk sizes

From
Scott Marlowe
Date:
On Thu, Sep 1, 2011 at 6:38 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote:
> Op 01-09-11 14:22, Scott Marlowe schreef:

>> Yeah, could be.  Take a look at this page:
>> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the
>> query there sheds some light on your situ.
>
> thanks for this answer
>
> if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I
> suppose that is the reason

Also look into installing something like nagios and the
check_postgresql.pl plugin to keep track of these things before they
get out of hand.

csb time:  Back in the day when pg 6.5.3 and 7.0 was new and
interesting, I had a table that was 80k or so, and an index that was
about 100M.  Back when dual core machines were servers, and 1G ram was
an extravagance.   I had a process that deleted everything from the
table each night and replaced it, and the index was so huge that
lookups were taking something like 10 seconds each.  A simple drop /
create index fixed it right up.  The check_postgresql.pl script is a
god sent tool to keep your db healthy and happy.

Re: strange table disk sizes

From
"Albe Laurenz"
Date:
Rik Bellens wrote:
> I have two large tables in a database, one containing original data
and
> the other one derived from the first table. The first table contains
> several columns and indexes, while the second table has less columns
and
> only one index. Both tables have the same number of rows.
Nevertheless,
> the second table is much larger in disk size than the first one. How
can
> this be explained?

Maybe the second table contains more empty space.

See
http://www.postgresql.org/docs/9.0/static/routine-vacuuming.html#VACUUM-
FOR-SPACE-RECOVERY

Yours,
Laurenz Albe

Re: strange table disk sizes

From
Rik Bellens
Date:
Op 01-09-11 14:43, Scott Marlowe schreef:
> On Thu, Sep 1, 2011 at 6:38 AM, Rik Bellens<rik.bellens@telin.ugent.be>  wrote:
>> Op 01-09-11 14:22, Scott Marlowe schreef:
>>> Yeah, could be.  Take a look at this page:
>>> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the
>>> query there sheds some light on your situ.
>> thanks for this answer
>>
>> if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I
>> suppose that is the reason
> Also look into installing something like nagios and the
> check_postgresql.pl plugin to keep track of these things before they
> get out of hand.
>
> csb time:  Back in the day when pg 6.5.3 and 7.0 was new and
> interesting, I had a table that was 80k or so, and an index that was
> about 100M.  Back when dual core machines were servers, and 1G ram was
> an extravagance.   I had a process that deleted everything from the
> table each night and replaced it, and the index was so huge that
> lookups were taking something like 10 seconds each.  A simple drop /
> create index fixed it right up.  The check_postgresql.pl script is a
> god sent tool to keep your db healthy and happy.
after running reindex on the stats_count_pkey, the disk size of this
index was reduced with about 2 gigs, but the size of the table itself
was still very large.
running 'vacuum full' also reduced the table size from 14 gigs to about
2 gigs.
I will now regullarly check the database with the mentioned tools and
queries.
thank you for the very usefull tips


Re: strange table disk sizes

From
Scott Marlowe
Date:
On Thu, Sep 1, 2011 at 10:04 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote:
>
> after running reindex on the stats_count_pkey, the disk size of this index
> was reduced with about 2 gigs, but the size of the table itself was still
> very large.
> running 'vacuum full' also reduced the table size from 14 gigs to about 2
> gigs.
> I will now regullarly check the database with the mentioned tools and
> queries.
> thank you for the very usefull tips

Excellent.  keep in mind that vacuum full on a table can lead to index
bloat, so you might need to reindex after that.