Thread: Why is my database so big?

Why is my database so big?

From
Andrew Smith
Date:
Hello,

I am setting up a proof of concept database to store some historical data.  Whilst I've used PostgreSQL a bit in the past this is the first time I've looked into disk usage due to the amount of data that could potentially be stored. I've done a quick test and I'm a little confused as to why it is occupying so much space on disk. Here is my table definition:

CREATE TABLE "TestSize"
(
  "Id" integer NOT NULL,
  "Time" timestamp without time zone NOT NULL,
  "Value" real NOT NULL,
  "Status" smallint NOT NULL,
  PRIMARY KEY ("Id", "Time")
);

CREATE INDEX test_index ON "TestSize" ("Id");

With a completely empty table the database is 7 MB. After I insert 1 million records into the table the database is 121 MB. My understanding is that each of the fields is sized as follows:

integer - 4 bytes
timestamp without time zone - 8 bytes
real - 4 bytes
smallint - 2 bytes

So for 1 million records, it needs at least 18 million bytes, or ~17 MB to store the data. Now I'm sure there is extra space required for managing the primary key fields, the index and other misc overhead involved in getting this data into the internal storage format used by PostgreSQL. But even if I triple the number of bytes stored for each record, I only end up with 51 MB or so. Am I missing something obvious? 

Cheers,

Andrew

Re: Why is my database so big?

From
Venkata Balaji N
Date:


On Mon, Feb 22, 2016 at 3:20 PM, Andrew Smith <laconical@gmail.com> wrote:
Hello,

I am setting up a proof of concept database to store some historical data.  Whilst I've used PostgreSQL a bit in the past this is the first time I've looked into disk usage due to the amount of data that could potentially be stored. I've done a quick test and I'm a little confused as to why it is occupying so much space on disk. Here is my table definition:

CREATE TABLE "TestSize"
(
  "Id" integer NOT NULL,
  "Time" timestamp without time zone NOT NULL,
  "Value" real NOT NULL,
  "Status" smallint NOT NULL,
  PRIMARY KEY ("Id", "Time")
);

CREATE INDEX test_index ON "TestSize" ("Id");

With a completely empty table the database is 7 MB. After I insert 1 million records into the table the database is 121 MB. My understanding is that each of the fields is sized as follows:

integer - 4 bytes
timestamp without time zone - 8 bytes
real - 4 bytes
smallint - 2 bytes

So for 1 million records, it needs at least 18 million bytes, or ~17 MB to store the data. Now I'm sure there is extra space required for managing the primary key fields, the index and other misc overhead involved in getting this data into the internal storage format used by PostgreSQL. But even if I triple the number of bytes stored for each record, I only end up with 51 MB or so. Am I missing something obvious? 

PostgreSQL version please ?

What is the output of below query ?

=# select pg_size_pretty(pg_relation_size('TestSize'));

Do you see any pg_toast tables in the database ? How are you calculating the database size ?

Each field size could vary depending upon the number of characters you entered. If you wish to understand the table size stats and its tuple size "pgstattuple" contrib module is the way to go.

There are other components in the postgresql's data-directory which occupy the disk space - postgresql logs (if enabled), pg_xlog etc..

What is size of each directory in the data-directory on the disk ?

Regards,
Venkata B N

Fujitsu Australia

Re: Why is my database so big?

From
Tom Lane
Date:
Andrew Smith <laconical@gmail.com> writes:
> I am setting up a proof of concept database to store some historical data.
> Whilst I've used PostgreSQL a bit in the past this is the first time I've
> looked into disk usage due to the amount of data that could potentially be
> stored. I've done a quick test and I'm a little confused as to why it is
> occupying so much space on disk. Here is my table definition:

> CREATE TABLE "TestSize"
> (
>   "Id" integer NOT NULL,
>   "Time" timestamp without time zone NOT NULL,
>   "Value" real NOT NULL,
>   "Status" smallint NOT NULL,
>   PRIMARY KEY ("Id", "Time")
> );

> CREATE INDEX test_index ON "TestSize" ("Id");

> With a completely empty table the database is 7 MB. After I insert 1
> million records into the table the database is 121 MB. My understanding is
> that each of the fields is sized as follows:

> integer - 4 bytes
> timestamp without time zone - 8 bytes
> real - 4 bytes
> smallint - 2 bytes

> So for 1 million records, it needs at least 18 million bytes, or ~17 MB to
> store the data. Now I'm sure there is extra space required for managing the
> primary key fields, the index and other misc overhead involved in getting
> this data into the internal storage format used by PostgreSQL. But even if
> I triple the number of bytes stored for each record, I only end up with 51
> MB or so. Am I missing something obvious?

It doesn't sound that far out of line.  Postgres is not great with narrow
tables like this one :-(.  The factors you're not accounting for include:

* Alignment padding.  On a 64-bit machine those fields would occupy 24
bytes, not 18, because row widths are always going to be multiples of 8.

* Row header overhead, which is 28 bytes per row (24-byte row header plus
4-byte row pointer).

* That primary key index is going to need 16 data bytes per entry
(alignment again), plus an 8-byte index tuple header, plus a 4-byte row
pointer.

* The other index similarly requires 8+8+4 bytes per row.

* Indexes tend not to be packed completely full.  If you load a PG btree
in exactly sequential order, the leaf pages should get packed to about
the index fillfactor (90% by default).  If you load in random order,
ancient wisdom is that the steady-state load factor for a btree is
about 66%.

* There's some other inefficiencies from upper-level btree pages, page
header overhead, inability to split rows across pages, etc; though these
tend to not amount to much unless you have wide rows or wide index
entries.

Given the above considerations, the *minimum* size of this table plus
indexes is 100 bytes/row.  Your observed result of 114 bytes/row
suggests you're getting an index load factor of around 80%, if I counted
on my fingers correctly.  That's not awful, but I'd guess that at least
one of the indexes is getting loaded nonsequentially.

You could REINDEX the indexes (possibly after playing with their
fillfactor settings) to improve their density.  But you're not going
to be able to move the needle by more than about 10% overall, so
personally I wouldn't bother.

The long and the short of it is that Postgres is more oriented to
OLTP-style applications where access to and update of individual
rows is the key performance metric.  If you're dealing with historical
data and mainly want aggregated query results, it's possible you'd
get better performance and more-compact storage from a column-store
database.

There's ongoing investigation into extending Postgres to support
column-style storage for better support of applications like that;
but any such feature is probably several years away, and it will
not come without performance compromises of its own.

            regards, tom lane


Re: Why is my database so big?

From
"FarjadFarid\(ChkNet\)"
Date:
Tom, thanks for your unbiased detailed response.

Interesting post.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: 22 February 2016 05:06
To: Andrew Smith
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Why is my database so big?

Andrew Smith <laconical@gmail.com> writes:
> I am setting up a proof of concept database to store some historical data.
> Whilst I've used PostgreSQL a bit in the past this is the first time
> I've looked into disk usage due to the amount of data that could
> potentially be stored. I've done a quick test and I'm a little
> confused as to why it is occupying so much space on disk. Here is my table
definition:

> CREATE TABLE "TestSize"
> (
>   "Id" integer NOT NULL,
>   "Time" timestamp without time zone NOT NULL,
>   "Value" real NOT NULL,
>   "Status" smallint NOT NULL,
>   PRIMARY KEY ("Id", "Time")
> );

> CREATE INDEX test_index ON "TestSize" ("Id");

> With a completely empty table the database is 7 MB. After I insert 1
> million records into the table the database is 121 MB. My
> understanding is that each of the fields is sized as follows:

> integer - 4 bytes
> timestamp without time zone - 8 bytes
> real - 4 bytes
> smallint - 2 bytes

> So for 1 million records, it needs at least 18 million bytes, or ~17
> MB to store the data. Now I'm sure there is extra space required for
> managing the primary key fields, the index and other misc overhead
> involved in getting this data into the internal storage format used by
> PostgreSQL. But even if I triple the number of bytes stored for each
> record, I only end up with 51 MB or so. Am I missing something obvious?

It doesn't sound that far out of line.  Postgres is not great with narrow
tables like this one :-(.  The factors you're not accounting for include:

* Alignment padding.  On a 64-bit machine those fields would occupy 24
bytes, not 18, because row widths are always going to be multiples of 8.

* Row header overhead, which is 28 bytes per row (24-byte row header plus
4-byte row pointer).

* That primary key index is going to need 16 data bytes per entry (alignment
again), plus an 8-byte index tuple header, plus a 4-byte row pointer.

* The other index similarly requires 8+8+4 bytes per row.

* Indexes tend not to be packed completely full.  If you load a PG btree in
exactly sequential order, the leaf pages should get packed to about the
index fillfactor (90% by default).  If you load in random order, ancient
wisdom is that the steady-state load factor for a btree is about 66%.

* There's some other inefficiencies from upper-level btree pages, page
header overhead, inability to split rows across pages, etc; though these
tend to not amount to much unless you have wide rows or wide index entries.

Given the above considerations, the *minimum* size of this table plus
indexes is 100 bytes/row.  Your observed result of 114 bytes/row suggests
you're getting an index load factor of around 80%, if I counted on my
fingers correctly.  That's not awful, but I'd guess that at least one of the
indexes is getting loaded nonsequentially.

You could REINDEX the indexes (possibly after playing with their fillfactor
settings) to improve their density.  But you're not going to be able to move
the needle by more than about 10% overall, so personally I wouldn't bother.

The long and the short of it is that Postgres is more oriented to OLTP-style
applications where access to and update of individual rows is the key
performance metric.  If you're dealing with historical data and mainly want
aggregated query results, it's possible you'd get better performance and
more-compact storage from a column-store database.

There's ongoing investigation into extending Postgres to support
column-style storage for better support of applications like that; but any
such feature is probably several years away, and it will not come without
performance compromises of its own.

            regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



Re: Why is my database so big?

From
Stephen Frost
Date:
All,

* FarjadFarid(ChkNet) (farjad.farid@checknetworks.com) wrote:
> Tom, thanks for your unbiased detailed response.
>
> Interesting post.

Please don't top-post.  My comments are in-line, below.

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
> Sent: 22 February 2016 05:06
> To: Andrew Smith
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Why is my database so big?
>
> Andrew Smith <laconical@gmail.com> writes:
> > I am setting up a proof of concept database to store some historical data.
> > Whilst I've used PostgreSQL a bit in the past this is the first time
> > I've looked into disk usage due to the amount of data that could
> > potentially be stored. I've done a quick test and I'm a little
> > confused as to why it is occupying so much space on disk. Here is my table
> definition:
>
> > CREATE TABLE "TestSize"
> > (
> >   "Id" integer NOT NULL,
> >   "Time" timestamp without time zone NOT NULL,
> >   "Value" real NOT NULL,
> >   "Status" smallint NOT NULL,
> >   PRIMARY KEY ("Id", "Time")
> > );
>
> > CREATE INDEX test_index ON "TestSize" ("Id");

Note that you don't really need an index on "Id" because including a
primary key will automatically include an index on those fields, and an
index over ("Id", "Time") can be used to satisfy queries which have a
conditional on just the "Id" column.  Removing that extra index will
likely help with space issues.

> > With a completely empty table the database is 7 MB. After I insert 1
> > million records into the table the database is 121 MB. My
> > understanding is that each of the fields is sized as follows:
>
> > integer - 4 bytes
> > timestamp without time zone - 8 bytes
> > real - 4 bytes
> > smallint - 2 bytes

I'd recommend against using timestamp w/o time zone.  For starters, as
noted, it's not actually saving you any space over timestamp w/ time
zone, and second, it makes working with that field painful and prone to
error.

> The long and the short of it is that Postgres is more oriented to OLTP-style
> applications where access to and update of individual rows is the key
> performance metric.  If you're dealing with historical data and mainly want
> aggregated query results, it's possible you'd get better performance and
> more-compact storage from a column-store database.
>
> There's ongoing investigation into extending Postgres to support
> column-style storage for better support of applications like that; but any
> such feature is probably several years away, and it will not come without
> performance compromises of its own.

One approach to dealing with the PG per-row overhead is to use arrays
instead of rows, when you don't need that per-row visibility
information.  I've found that to be very successful for single-column
tables, but the technique may work reasonably well for other structures
also.

Thanks!

Stephen

Attachment