Thread: strange table disk sizes
Hello, I have two large tables in a database, one containing original data and the other one derived from the first table. The first table contains several columns and indexes, while the second table has less columns and only one index. Both tables have the same number of rows. Nevertheless, the second table is much larger in disk size than the first one. How can this be explained? The tables are defined below. The first table 'measurement' contains information about measurements by different devices on different times. The second table 'stats_count' is created in order to be able to quickly query the number of measurements for a particular device in an arbitrary time interval. To accomplish this the second table contains for each measurement a cumulative count of the number of measurements up until that time. The table is kept synchronised with the first table with triggers. The first table contains 4 integer columns, one enum column and one time column, which would normally take 28 bytes for each row. The table contains about 50 million rows. This means about 1.5 gigabyte of data. The actual disk size of this table (without indexes) is 3 gigabyte, which is propably explainable because of some overhead. The disk size with indexes is about 9 gigabyte. The second table on the other hand only contains 3 integer columns and one time column. Nevertheless the size on disk of this table is about 14 gigabyte without indexes. There is one index on this table which uses an extra 4 gigabyte. Can anyone explain this? One possible reason is that the primary key on table measurement is an integer, while the primary key on table stats_count is a combination of two integers and a time. Logically it requires more disk space to store the second key, however the same two integers and time also form an index in the first table, so should take the same amount of disk space on the first table. Or does a primary key take more disk space than another index or unique constraint? Would it therefore be more efficient (=less disk space) to use an extra id column in the table stats_count as primary key and the combination (deviceid, type, time) as a regular index? And if so, will a query be still as fast (I can imagine that if an index uses less disk space, it will not be as fast). Another possible reason could be the foreign key that is present in the table stats_count that references the combination (deviceid, type, time) in table measurements. Does this foreign key take that much disk space? Any help would be greatly appreciated, Kind regards, Rik CREATE TABLE measurement ( id serial NOT NULL, deviceid integer NOT NULL, "time" timestamp without time zone NOT NULL, javaobjectid integer NOT NULL, javaobjectsubid integer NOT NULL, "type" measurementtype NOT NULL, CONSTRAINT measurement_pkey PRIMARY KEY (id), CONSTRAINT measurement_deviceid_fkey FOREIGN KEY (deviceid) REFERENCES device (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT measurement_javaobject_unique UNIQUE (javaobjectid, javaobjectsubid), CONSTRAINT measurement_unique UNIQUE (type, deviceid, "time") ) CREATE INDEX measurement_javaobjectsubid_index ON measurement USING btree (javaobjectsubid); CREATE TABLE stats_count ( deviceid integer NOT NULL, "time" timestamp without time zone NOT NULL, "type" measurementtype NOT NULL, cum_count integer, CONSTRAINT stats_count_pkey PRIMARY KEY (type, deviceid, "time"), CONSTRAINT stats_count_type_fkey FOREIGN KEY ("type", deviceid, "time") REFERENCES measurement ("type", deviceid, "time") MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION )
On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote: > > Hello, > > I have two large tables in a database, one containing original data and the > other one derived from the first table. The first table contains several > columns and indexes, while the second table has less columns and only one > index. Both tables have the same number of rows. Nevertheless, the second > table is much larger in disk size than the first one. How can this be > explained? This is most likely due to table bloat. In PostgreSQL when you update or delete a row, a dead version gets left behind. Vacuum eventually comes along and reclaims the empty space to be reused. If you delete / update a LOT of rows at once, then you'll have a lot of dead rows which can only be reused after vacuuming when you do more updates or deletes later on. A few salient questions. What version of PostgreSQL are you running? Is autovacuum running? Do you do a LOT of bulk deletes / updates? If you do a lot of bulk deletes on this table, and you delete everything, can you switch to using the truncate command instead?
Op 01-09-11 13:31, Scott Marlowe schreef: > On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens<rik.bellens@telin.ugent.be> wrote: >> Hello, >> >> I have two large tables in a database, one containing original data and the >> other one derived from the first table. The first table contains several >> columns and indexes, while the second table has less columns and only one >> index. Both tables have the same number of rows. Nevertheless, the second >> table is much larger in disk size than the first one. How can this be >> explained? > This is most likely due to table bloat. In PostgreSQL when you update > or delete a row, a dead version gets left behind. Vacuum eventually > comes along and reclaims the empty space to be reused. If you delete > / update a LOT of rows at once, then you'll have a lot of dead rows > which can only be reused after vacuuming when you do more updates or > deletes later on. > > A few salient questions. What version of PostgreSQL are you running? > Is autovacuum running? Do you do a LOT of bulk deletes / updates? If > you do a lot of bulk deletes on this table, and you delete everything, > can you switch to using the truncate command instead? I use version 8.3. I see the 'autovacuum launcher process' and 'autovacuum worker process' in the process list, so I suppose autovacuum is running. Rows in the measurement table are added once and never deleted or updated. Adding a row to this table triggers a function that adds a row to the stats_count table. Normally rows are added chronologically. So rows in the stats_count table are normally not updated either. If however, for some reason, a measurement is added from an older time, all rows of that device which come after this time, are updated, but I don't think this will happen very often. The table 'stats_count' was created in a later stage, so the first 45M rows were added at once and chronologically. However, because the function to initialize this table took a long time and the client application crashed a few times, I had to restart this function several times. Can it be that there is some trash left from running this function several times without finishing it? Would it be a solution to run 'VACUUM FULL' to reclaim some disk space?
Op 01-09-11 14:22, Scott Marlowe schreef: > On Thu, Sep 1, 2011 at 6:05 AM, Rik Bellens<rik.bellens@telin.ugent.be> wrote: >> Op 01-09-11 13:31, Scott Marlowe schreef: >>> On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens<rik.bellens@telin.ugent.be> >>> wrote: >>>> Hello, >>>> >>>> I have two large tables in a database, one containing original data and >>>> the >>>> other one derived from the first table. The first table contains several >>>> columns and indexes, while the second table has less columns and only one >>>> index. Both tables have the same number of rows. Nevertheless, the second >>>> table is much larger in disk size than the first one. How can this be >>>> explained? >>> This is most likely due to table bloat. In PostgreSQL when you update >>> or delete a row, a dead version gets left behind. Vacuum eventually >>> comes along and reclaims the empty space to be reused. If you delete >>> / update a LOT of rows at once, then you'll have a lot of dead rows >>> which can only be reused after vacuuming when you do more updates or >>> deletes later on. >>> >>> A few salient questions. What version of PostgreSQL are you running? >>> Is autovacuum running? Do you do a LOT of bulk deletes / updates? If >>> you do a lot of bulk deletes on this table, and you delete everything, >>> can you switch to using the truncate command instead? >> I use version 8.3. I see the 'autovacuum launcher process' and 'autovacuum >> worker process' in the process list, so I suppose autovacuum is running. >> >> Rows in the measurement table are added once and never deleted or updated. >> Adding a row to this table triggers a function that adds a row to the >> stats_count table. Normally rows are added chronologically. So rows in the >> stats_count table are normally not updated either. If however, for some >> reason, a measurement is added from an older time, all rows of that device >> which come after this time, are updated, but I don't think this will happen >> very often. >> >> The table 'stats_count' was created in a later stage, so the first 45M rows >> were added at once and chronologically. However, because the function to >> initialize this table took a long time and the client application crashed a >> few times, I had to restart this function several times. Can it be that >> there is some trash left from running this function several times without >> finishing it? Would it be a solution to run 'VACUUM FULL' to reclaim some >> disk space? > Yeah, could be. Take a look at this page: > http://wiki.postgresql.org/wiki/Show_database_bloat and see if the > query there sheds some light on your situ. thanks for this answer if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I suppose that is the reason
On Thu, Sep 1, 2011 at 6:38 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote: > Op 01-09-11 14:22, Scott Marlowe schreef: >> >> On Thu, Sep 1, 2011 at 6:05 AM, Rik Bellens<rik.bellens@telin.ugent.be> >> wrote: >>> >>> Op 01-09-11 13:31, Scott Marlowe schreef: >>>> >>>> On Thu, Sep 1, 2011 at 4:32 AM, Rik Bellens<rik.bellens@telin.ugent.be> >>>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I have two large tables in a database, one containing original data and >>>>> the >>>>> other one derived from the first table. The first table contains >>>>> several >>>>> columns and indexes, while the second table has less columns and only >>>>> one >>>>> index. Both tables have the same number of rows. Nevertheless, the >>>>> second >>>>> table is much larger in disk size than the first one. How can this be >>>>> explained? >>>> >>>> This is most likely due to table bloat. In PostgreSQL when you update >>>> or delete a row, a dead version gets left behind. Vacuum eventually >>>> comes along and reclaims the empty space to be reused. If you delete >>>> / update a LOT of rows at once, then you'll have a lot of dead rows >>>> which can only be reused after vacuuming when you do more updates or >>>> deletes later on. >>>> >>>> A few salient questions. What version of PostgreSQL are you running? >>>> Is autovacuum running? Do you do a LOT of bulk deletes / updates? If >>>> you do a lot of bulk deletes on this table, and you delete everything, >>>> can you switch to using the truncate command instead? >>> >>> I use version 8.3. I see the 'autovacuum launcher process' and >>> 'autovacuum >>> worker process' in the process list, so I suppose autovacuum is running. >>> >>> Rows in the measurement table are added once and never deleted or >>> updated. >>> Adding a row to this table triggers a function that adds a row to the >>> stats_count table. Normally rows are added chronologically. So rows in >>> the >>> stats_count table are normally not updated either. If however, for some >>> reason, a measurement is added from an older time, all rows of that >>> device >>> which come after this time, are updated, but I don't think this will >>> happen >>> very often. >>> >>> The table 'stats_count' was created in a later stage, so the first 45M >>> rows >>> were added at once and chronologically. However, because the function to >>> initialize this table took a long time and the client application crashed >>> a >>> few times, I had to restart this function several times. Can it be that >>> there is some trash left from running this function several times without >>> finishing it? Would it be a solution to run 'VACUUM FULL' to reclaim some >>> disk space? >> >> Yeah, could be. Take a look at this page: >> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the >> query there sheds some light on your situ. > > thanks for this answer > > if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I > suppose that is the reason Sounds like it. Take a look here: http://wiki.postgresql.org/wiki/Index_Maintenance
On Thu, Sep 1, 2011 at 6:38 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote: > Op 01-09-11 14:22, Scott Marlowe schreef: >> Yeah, could be. Take a look at this page: >> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the >> query there sheds some light on your situ. > > thanks for this answer > > if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I > suppose that is the reason Also look into installing something like nagios and the check_postgresql.pl plugin to keep track of these things before they get out of hand. csb time: Back in the day when pg 6.5.3 and 7.0 was new and interesting, I had a table that was 80k or so, and an index that was about 100M. Back when dual core machines were servers, and 1G ram was an extravagance. I had a process that deleted everything from the table each night and replaced it, and the index was so huge that lookups were taking something like 10 seconds each. A simple drop / create index fixed it right up. The check_postgresql.pl script is a god sent tool to keep your db healthy and happy.
Rik Bellens wrote: > I have two large tables in a database, one containing original data and > the other one derived from the first table. The first table contains > several columns and indexes, while the second table has less columns and > only one index. Both tables have the same number of rows. Nevertheless, > the second table is much larger in disk size than the first one. How can > this be explained? Maybe the second table contains more empty space. See http://www.postgresql.org/docs/9.0/static/routine-vacuuming.html#VACUUM- FOR-SPACE-RECOVERY Yours, Laurenz Albe
Op 01-09-11 14:43, Scott Marlowe schreef: > On Thu, Sep 1, 2011 at 6:38 AM, Rik Bellens<rik.bellens@telin.ugent.be> wrote: >> Op 01-09-11 14:22, Scott Marlowe schreef: >>> Yeah, could be. Take a look at this page: >>> http://wiki.postgresql.org/wiki/Show_database_bloat and see if the >>> query there sheds some light on your situ. >> thanks for this answer >> >> if i run the query, I get 12433752064 wasted bytes on stats_count_pkey, so I >> suppose that is the reason > Also look into installing something like nagios and the > check_postgresql.pl plugin to keep track of these things before they > get out of hand. > > csb time: Back in the day when pg 6.5.3 and 7.0 was new and > interesting, I had a table that was 80k or so, and an index that was > about 100M. Back when dual core machines were servers, and 1G ram was > an extravagance. I had a process that deleted everything from the > table each night and replaced it, and the index was so huge that > lookups were taking something like 10 seconds each. A simple drop / > create index fixed it right up. The check_postgresql.pl script is a > god sent tool to keep your db healthy and happy. after running reindex on the stats_count_pkey, the disk size of this index was reduced with about 2 gigs, but the size of the table itself was still very large. running 'vacuum full' also reduced the table size from 14 gigs to about 2 gigs. I will now regullarly check the database with the mentioned tools and queries. thank you for the very usefull tips
On Thu, Sep 1, 2011 at 10:04 AM, Rik Bellens <rik.bellens@telin.ugent.be> wrote: > > after running reindex on the stats_count_pkey, the disk size of this index > was reduced with about 2 gigs, but the size of the table itself was still > very large. > running 'vacuum full' also reduced the table size from 14 gigs to about 2 > gigs. > I will now regullarly check the database with the mentioned tools and > queries. > thank you for the very usefull tips Excellent. keep in mind that vacuum full on a table can lead to index bloat, so you might need to reindex after that.