Thread: Tuple storage overhead

Tuple storage overhead

From
Peter Bex
Date:
Hi all,

I have a table with three columns: one integer and two doubles.
There are two indexes defined (one on the integer and one on one
of the doubles).  This table stores 700000 records, which take up
30 Mb according to pg_relation_size(), and the total relation size
is 66 Mb.

I expected the disk space usage to be halved by changing the doubles
to floats, but it only dropped by 5 MB!  (I tried various approaches,
including dumping and restoring to make sure there was no uncollected
garbage lying around)

Someone on IRC told me the per-tuple storage overhead is pretty big,
and asked me to create a similar table containing only integers:

db=# create table testing (  x integer );
db=# INSERT INTO testing (x) VALUES (generate_series(1, 700000));
dacolt_development=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
 pg_size_pretty
 ----------------
  24 MB
  (1 row)
db=# SELECT pg_size_pretty(pg_relation_size('testing'));
 pg_size_pretty
 ----------------
  24 MB
db=# CREATE INDEX testing_1 ON testing (x);
db=# CREATE INDEX testing_2 ON testing (x);
db=# SELECT pg_size_pretty(pg_relation_size('testing'));
 pg_size_pretty
 ----------------
  24 MB
  (1 row)
db=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
 pg_size_pretty
 ----------------
  54 MB
  (1 row)

Is there a way to reduce the per-tuple storage overhead?

The reason I'm asking is that I have tons of tables like this,
and some data sets are much bigger than this.  In a relatively
simple testcase I'm importing data from text files which are
5.7 Gb in total, and this causes the db size to grow to 34Gb.

This size is just one small sample of many such datasets that I
need to import, so disk size is really an important factor.

Regards,
Peter
--
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                            -- Donald Knuth

Re: Tuple storage overhead

From
Szymon Guz
Date:


2010/4/16 Peter Bex <Peter.Bex@xs4all.nl>
Hi all,

I have a table with three columns: one integer and two doubles.
There are two indexes defined (one on the integer and one on one
of the doubles).  This table stores 700000 records, which take up
30 Mb according to pg_relation_size(), and the total relation size
is 66 Mb.

I expected the disk space usage to be halved by changing the doubles
to floats, but it only dropped by 5 MB!  (I tried various approaches,
including dumping and restoring to make sure there was no uncollected
garbage lying around)

Someone on IRC told me the per-tuple storage overhead is pretty big,
and asked me to create a similar table containing only integers:

db=# create table testing (  x integer );
db=# INSERT INTO testing (x) VALUES (generate_series(1, 700000));
dacolt_development=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
 pg_size_pretty
 ----------------
 24 MB
 (1 row)
db=# SELECT pg_size_pretty(pg_relation_size('testing'));
 pg_size_pretty
 ----------------
 24 MB
db=# CREATE INDEX testing_1 ON testing (x);
db=# CREATE INDEX testing_2 ON testing (x);
db=# SELECT pg_size_pretty(pg_relation_size('testing'));
 pg_size_pretty
 ----------------
 24 MB
 (1 row)
db=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
 pg_size_pretty
 ----------------
 54 MB
 (1 row)

Is there a way to reduce the per-tuple storage overhead?

The reason I'm asking is that I have tons of tables like this,
and some data sets are much bigger than this.  In a relatively
simple testcase I'm importing data from text files which are
5.7 Gb in total, and this causes the db size to grow to 34Gb.

This size is just one small sample of many such datasets that I
need to import, so disk size is really an important factor.


File pages are not fully filled from the start as that could result in bad performance of queries later. If you want to have those pages fully filled, then you can set the table parameter like fillfactor. But be aware that many queries can be much slower and changing those parameters isn't usually a good idea. If you won't ever do any updates on those tables then it could work.



regards
Szymon Guz
 

Re: Tuple storage overhead

From
Peter Bex
Date:
On Fri, Apr 16, 2010 at 11:59:38AM +0200, Szymon Guz wrote:
> File pages are not fully filled from the start as that could result in bad
> performance of queries later.

The manual page you linked to says something else:
"The fillfactor for a table is a percentage between 10 and 100.
 100 (complete packing) is the default."

However, the index has a default fill factor of 90, so I guess
I can tweak that to 100 to shave off another few percent.
(there will be no updates nor extra inserts on these tables)

Thanks for the tip!  I hope there are more ways to tweak it, though
because this doesn't save that much.

Regards,
Peter
--
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                            -- Donald Knuth

Re: Tuple storage overhead

From
Szymon Guz
Date:


2010/4/16 Peter Bex <Peter.Bex@xs4all.nl>
On Fri, Apr 16, 2010 at 11:59:38AM +0200, Szymon Guz wrote:
> File pages are not fully filled from the start as that could result in bad
> performance of queries later.

The manual page you linked to says something else:
"The fillfactor for a table is a percentage between 10 and 100.
 100 (complete packing) is the default."

However, the index has a default fill factor of 90, so I guess
I can tweak that to 100 to shave off another few percent.
(there will be no updates nor extra inserts on these tables)

Thanks for the tip!  I hope there are more ways to tweak it, though
because this doesn't save that much.


I thought that the default fillfactor was much smaller (and haven't checked that now)...  sorry for messing that up.
But let's think of it from the other side: what do you want to do with that data? Maybe PostgreSQL with it's MVCC's overhead isn't the best solution for your needs.

regards
Szymon Guz

Re: Tuple storage overhead

From
Richard Huxton
Date:
On 16/04/10 10:41, Peter Bex wrote:
> Hi all,
>
> I have a table with three columns: one integer and two doubles.
> There are two indexes defined (one on the integer and one on one
> of the doubles).  This table stores 700000 records, which take up
> 30 Mb according to pg_relation_size(), and the total relation size
> is 66 Mb.
[snip]
> Is there a way to reduce the per-tuple storage overhead?

Short answer - no.

The database has to track visibility of every row - when it was
inserted, deleted etc to support the MVCC concurrency system.

http://www.postgresql.org/docs/8.4/static/storage-page-layout.html

That means 24 bytes of overhead (on most systems) for each row. That's
higher than some other RDBMS but they'll all have some overhead.

> The reason I'm asking is that I have tons of tables like this,
> and some data sets are much bigger than this.  In a relatively
> simple testcase I'm importing data from text files which are
> 5.7 Gb in total, and this causes the db size to grow to 34Gb.

Anything from double to ten times the size isn't unexpected, depending
on row-sizes and how many indexes you are talking about.

--
   Richard Huxton
   Archonet Ltd

Re: Tuple storage overhead

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> On 16/04/10 10:41, Peter Bex wrote:
>> Is there a way to reduce the per-tuple storage overhead?

> Short answer - no.

About the only thing you could really do is rethink the table layout.
If you can put more data per row, then the fractional overhead for
tuple headers naturally will be less.  Whether this results in a
convenient-to-use table is hard to guess without knowing your
application ...

            regards, tom lane

Re: Tuple storage overhead

From
Merlin Moncure
Date:
On Fri, Apr 16, 2010 at 5:41 AM, Peter Bex <Peter.Bex@xs4all.nl> wrote:
> Hi all,
>
> I have a table with three columns: one integer and two doubles.
> There are two indexes defined (one on the integer and one on one
> of the doubles).  This table stores 700000 records, which take up
> 30 Mb according to pg_relation_size(), and the total relation size
> is 66 Mb.
>
> I expected the disk space usage to be halved by changing the doubles
> to floats, but it only dropped by 5 MB!  (I tried various approaches,
> including dumping and restoring to make sure there was no uncollected
> garbage lying around)
>
> Someone on IRC told me the per-tuple storage overhead is pretty big,
> and asked me to create a similar table containing only integers:
>
> db=# create table testing (  x integer );
> db=# INSERT INTO testing (x) VALUES (generate_series(1, 700000));
> dacolt_development=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
>  pg_size_pretty
>  ----------------
>  24 MB
>  (1 row)
> db=# SELECT pg_size_pretty(pg_relation_size('testing'));
>  pg_size_pretty
>  ----------------
>  24 MB
> db=# CREATE INDEX testing_1 ON testing (x);
> db=# CREATE INDEX testing_2 ON testing (x);
> db=# SELECT pg_size_pretty(pg_relation_size('testing'));
>  pg_size_pretty
>  ----------------
>  24 MB
>  (1 row)
> db=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
>  pg_size_pretty
>  ----------------
>  54 MB
>  (1 row)
>
> Is there a way to reduce the per-tuple storage overhead?
>
> The reason I'm asking is that I have tons of tables like this,
> and some data sets are much bigger than this.  In a relatively
> simple testcase I'm importing data from text files which are
> 5.7 Gb in total, and this causes the db size to grow to 34Gb.
>
> This size is just one small sample of many such datasets that I
> need to import, so disk size is really an important factor.

If you are storing big data and want to keep the overhead low, the
first thing you need to examine is organizing your data into arrays.
This involves tradeoffs of course and may not work but it's worth a
shot!

merlin

Re: Tuple storage overhead

From
Peter Bex
Date:
On Fri, Apr 16, 2010 at 12:40:21PM +0200, Szymon Guz wrote:
> I thought that the default fillfactor was much smaller (and haven't checked
> that now)...  sorry for messing that up.
> But let's think of it from the other side: what do you want to do with that
> data? Maybe PostgreSQL with it's MVCC's overhead isn't the best solution for
> your needs.

I'm using this as part of a larger application. The data sets are one aspect
of it.  The idea is the following:

An engineering application generates time-dependent data. One simulation
yields a very big text file in tabular format, with hundreds or thousands
of columns with output values (often more than Postgres' column limit),
one row per timestamp.

One such file is generated for each permutation of input values which
influence the run of a simulation.

This text file is imported into a database so we can perform very quick
lookups on the numbers so they can be quickly plotted in a graph.
The user can select any number of input permutations and graph the
values of any selected output values to view the effect of the variation
in input.

One can plot any variable against any other, so one join is made for
each variable that we want to plot; it joins the timestep values of the
variable on the X axis to those on the Y axis.

Regards,
Peter
--
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                            -- Donald Knuth

Re: Tuple storage overhead

From
Peter Bex
Date:
On Fri, Apr 16, 2010 at 11:28:36AM -0400, Merlin Moncure wrote:
> If you are storing big data and want to keep the overhead low, the
> first thing you need to examine is organizing your data into arrays.
> This involves tradeoffs of course and may not work but it's worth a
> shot!

That does sound interesting.  However, I'm storing several hundreds or
thousands of data points (depending on the data set).  How is Postgresql's
overhead when it comes to extracting one or two items from an array in
a query?

Cheers,
Peter
--
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                            -- Donald Knuth