Thread: big database resulting in small dump
I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix database. The database on disk takes 10Gb. SQL dump takes only 2Gb. I've gone through http://archives.postgresql.org/pgsql-general/2008-08/msg00316.php and got some hints. Naturally, the biggest table is history (the second biggest is history_uint. Together they make about 95% of total size). I've tried to perform CLUSTER on it, but seemed to be taking forever (3 hours and still not completed). So I cancelled it and went with database drop and restore. It resulted in database taking up 6.4Gb instead of 10Gb. This is a good improvement, but still isn't quite what I expect. I would appreciate some clarification.
On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov <forn@ngs.ru> wrote: > I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix > database. The database on disk takes 10Gb. SQL dump takes only 2Gb. I've > gone through > http://archives.postgresql.org/pgsql-general/2008-08/msg00316.php and got > some hints. Naturally, the biggest table is history (the second biggest is > history_uint. Together they make about 95% of total size). I've tried to > perform CLUSTER on it, but seemed to be taking forever (3 hours and still > not completed). So I cancelled it and went with database drop and restore. > It resulted in database taking up 6.4Gb instead of 10Gb. This is a good > improvement, but still isn't quite what I expect. I would appreciate some > clarification. Its not entirely clear what behavior you expect here. Assuming that you're referring to running pg_dump, then you should just about never expect the size of the resulting dump to be equal to the amount of disk space the database server files consume on disk. For example, when I pg_dump a database that consumes about 290GB of disk, the resulting dump is about 1.3GB. This is normal & expected behavior.
Lonni J Friedman <netllama@gmail.com> writes: > On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov <forn@ngs.ru> wrote: >> I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix >> database. The database on disk takes 10Gb. SQL dump takes only 2Gb. > Its not entirely clear what behavior you expect here. Assuming that > you're referring to running pg_dump, then you should just about never > expect the size of the resulting dump to be equal to the amount of > disk space the database server files consume on disk. For example, > when I pg_dump a database that consumes about 290GB of disk, the > resulting dump is about 1.3GB. This is normal & expected behavior. The fine manual says someplace that databases are commonly about 5X the size of a plain-text dump, which is right in line with Ilya's results. Lonni's DB sounds a bit bloated :-(, though maybe he's got an atypically large set of indexes. regards, tom lane
On Fri, Jul 20, 2012 at 11:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Lonni J Friedman <netllama@gmail.com> writes: >> On Fri, Jul 20, 2012 at 11:05 AM, Ilya Ivanov <forn@ngs.ru> wrote: >>> I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix >>> database. The database on disk takes 10Gb. SQL dump takes only 2Gb. > >> Its not entirely clear what behavior you expect here. Assuming that >> you're referring to running pg_dump, then you should just about never >> expect the size of the resulting dump to be equal to the amount of >> disk space the database server files consume on disk. For example, >> when I pg_dump a database that consumes about 290GB of disk, the >> resulting dump is about 1.3GB. This is normal & expected behavior. > > The fine manual says someplace that databases are commonly about 5X the > size of a plain-text dump, which is right in line with Ilya's results. > Lonni's DB sounds a bit bloated :-(, though maybe he's got an atypically > large set of indexes. I do have a lot of indices. Also, I'm using a lot of partitions, so there are a relatively large number of tables.
well, it'd be good to have a link to the resource that says about 5x ratio, but in general I'm satisfied with that explanation. Thank you.
Ilya Ivanov <forn@ngs.ru> writes: > well, it'd be good to have a link to the resource that says about 5x ratio, > but in general I'm satisfied with that explanation. Thank you. [ digs around ... ] It's at the bottom of this page: http://www.postgresql.org/docs/9.1/static/install-requirements.html which I will grant is maybe not the best place for it anymore, since relatively few people do their own builds from source these days. regards, tom lane
On 07/21/2012 02:05 AM, Ilya Ivanov wrote:
I have a 8.4 database (installed on ubuntu 10.04 x86_64). It holds Zabbix database. The database on disk takes 10Gb. SQL dump takes only 2Gb. I've gone through http://archives.postgresql.org/pgsql-general/2008-08/msg00316.php and got some hints. Naturally, the biggest table is history (the second biggest is history_uint. Together they make about 95% of total size). I've tried to perform CLUSTER on it, but seemed to be taking forever (3 hours and still not completed). So I cancelled it and went with database drop and restore. It resulted in database taking up 6.4Gb instead of 10Gb. This is a good improvement, but still isn't quite what I expect. I would appreciate some clarification.
To elaborate on the answers already posted:
Plain text dumps only contain the data its self. In many databases the table contents are a small part of the overall database size. Additionally, data is stored on disk in a structure optimised for speed of access, not disk space consumption, so the same data can be much more compact in the dump format.
Finally, data in dumps is much, much more efficiently compressed than it is in the tables. In tables the main rows aren't compressed at all, and TOASTed values like big text fields are individually compressed, which is immensely less space efficient than compressing them all together. On the other hand, it allows random access where the dump format just doesn't - kind of important for a database!
The rest of the space is used by:
- Indexes, which can get quite big.
- Free space in tables from deleted rows that haven't yet been replaced by a new inserted row.
If you have a non-default FILLFACTOR there can be lots of this.
- "bloat" - wasted space in tables and indexes, typically caused by insufficiently frequent autovacuum
- ... probably more I've forgotten
When you dump and reload you not only get rid of any bloat in your tables and indexes, but you effectively REINDEX your database. PostgreSQL can often create much more compact and efficient index structures when it does a CREATE INDEX on a full table (like when restoring a dump) than when it does a CREATE INDEX on an empty table followed by lots of inserts.
--
Craig Ringer