Thread: Estimating Database Disk Space

Estimating Database Disk Space

From
Richard A Lough
Date:
Hi All,

I have attached my attempt at documenting the use of
hard disk storage by a postgres database. I'll point out
that I am still on the learning curve for postgres, and
my assumptions may well be completely wrong.

The main point I think that needs validation is whether
it is best to create a model of a database in order to get
an accurate estimate of the space needed.

TIA

Richard A LoughChapter 10. Disk Storage

Estimating Disk Storage requirements for a database.

The following advice is provided for a machine running Linux and Postgresql 7.1. The filesystem in use is ext2.

There is no easy way to estimate precisely the storage requirements for a given database.  If an accurate size estimate
isneeded, it is suggested that a dummy database is set up with suitable fields, numbers of tables and queries. It
shouldbe filled with suitable random data and operated as a database for a short time before attempting to measure the
diskstorage needed to store and manage all the information. 

Even this method provides only an estimate, and the figures should be carefully reviewed.

Some general guidelines can be provided to assist the assessment. The following figures are taken from the FAQ. They
supposethe database containd one table of 100000 lines, each with 20 characters and one integer. Including end-of-line
bytes,a text file containing the information would be 2.8 MegaBytes (MB).  

The Users Guide Ch 3 "Data Types" provides information on the storage requirements for each piece of information. We
findthat Numeric types: integer, is held in four bytes. With a 20 byte text field, the information in each row is held
in24 bytes. Each row has a header of 36 bytes (approximately) and a pointer to the table of 4 bytes. The total storage
requiredfor each row of data is 64 bytes.  

Each datapage size in postgres is 8192 bytes (8K) and each page can hold up to 128 rows. Our row size of 64 bytes
allowsthe page to be completely filled. If, for example the row size were 129 bytes, we could fit 63 rows into the
page,leaving 65 bytes unused. Similarly, our 100000 lines of text will try to fit into 781.25 pages, with the 782nd
pagebeing three-quarters empty.  

As another example, our 64 byte row could be used to store a date, an integer, and four real floating point numbers,
withtheir header and pointer. The Users Guide will reveal that each field takes four bytes to store the information. 

Our calculation tells us that we will use 782 pages, each 8K to store the data. That is about 6.4 MB. This is not the
wholestory. One table of 100000 rows is relatively inexpensive for postgres to manage. 

We would reasonably expect that a database of 10000 tables, each of 10 rows using 64 bytes would also need 6.4 MB for
storage.For comparison, we will assume that our second database has rows containing a date, an integer and four real
numbers.We should suspect, however, that postgres will need more resources to manage a database with a large number of
tables.If there are hidden overheads needed to manage tables our database of 10000 tables should reveal them.  

To discover these hidden resources we will need some tools to help us find what we are looking for. If our database is
alreadyin operation, disk useage (man du) is what we want to know. Try 
$ du -s * | sort -r -n | head -10
in $PGDATA

That should give you the ten largest users of disk space in the postgres data directory. Further examination will
reveala directory for each database. Unfortunately, they all have numbers that match the oid instad of the database
name.There is a program in contrib (oid2name) that takes these numbers and gives you the database name.     

The xlog segment can take up a lot of space. Each incremental increase in size is usually set at 16 MB. We can expect
tobe fortunate with our first database of one table in consuming "only" 16 MB, giving a total of about 22 MB. We can
expectour database of 10000 tables to range in size from 22 MB to 38MB, and consider ourselves unfortunate if disk
useagerises to 54 MB. That, however, is not the complete picture.  

We begin our next search by logging in to template1, and asking for the list of databases
template1> \l

This should show all the databases in the system, and the list should include template 0, template1, lo_demo and any
otherdatabases created on the system and still active. We are looking for those bits of information (parts of indexes)
whichremain behind even after  
$ vacuumdb database

is run on a database to clean "dead" information out of it.

Creating and deleting tables within a database is one activity known to cause this problem. To see the extent of the
problem,access each database and search for relpages, like so... 
select relname, relpages from pg_class order by relpages desc limit 10;

This should give the top ten sets of offenders within each database. Each relpage is 8K in size.

You need to do a lot of creating and deleting of tables to make this problem noticeable. If your database seems to be
growingexponentially and you are adding and deleting tables very frequently, this may be your problem. The only way at
present(March 2002) to fix the problem is to dump, delete, create, and restore the database.   

It's reasonable to omit calculations of this wastage from estimates of the disk space which will be used by the
database.If a problem with index bloat is suspected regular checks and measures to control the problem should be put in
place. 

Summary

The crucial question to ask when preparing any estimate is : "How accurate does this need to be?" If your estimate
needsto be closer than +100% -50% you will need some form of historical information to improve the accuracy of your
estimate.The most cost effective way of doing this is by generating a model of the database, operating it for a short
time,and measuring the storage needed.   

Re: Estimating Database Disk Space

From
Peter Eisentraut
Date:
Richard A Lough writes:

> The main point I think that needs validation is whether
> it is best to create a model of a database in order to get
> an accurate estimate of the space needed.

That's pretty much it.  If you want to know how much space something is
going to take, create it and look at it.  I don't think we want to get
users involved into counting the bits of each piece of overhead that comes
with a database.  That creates assumptions that are going to be hard to
overcome or change in the future.

--
Peter Eisentraut   peter_e@gmx.net