Thread: Tuple storage overhead
Hi all, I have a table with three columns: one integer and two doubles. There are two indexes defined (one on the integer and one on one of the doubles). This table stores 700000 records, which take up 30 Mb according to pg_relation_size(), and the total relation size is 66 Mb. I expected the disk space usage to be halved by changing the doubles to floats, but it only dropped by 5 MB! (I tried various approaches, including dumping and restoring to make sure there was no uncollected garbage lying around) Someone on IRC told me the per-tuple storage overhead is pretty big, and asked me to create a similar table containing only integers: db=# create table testing ( x integer ); db=# INSERT INTO testing (x) VALUES (generate_series(1, 700000)); dacolt_development=# SELECT pg_size_pretty(pg_total_relation_size('testing')); pg_size_pretty ---------------- 24 MB (1 row) db=# SELECT pg_size_pretty(pg_relation_size('testing')); pg_size_pretty ---------------- 24 MB db=# CREATE INDEX testing_1 ON testing (x); db=# CREATE INDEX testing_2 ON testing (x); db=# SELECT pg_size_pretty(pg_relation_size('testing')); pg_size_pretty ---------------- 24 MB (1 row) db=# SELECT pg_size_pretty(pg_total_relation_size('testing')); pg_size_pretty ---------------- 54 MB (1 row) Is there a way to reduce the per-tuple storage overhead? The reason I'm asking is that I have tons of tables like this, and some data sets are much bigger than this. In a relatively simple testcase I'm importing data from text files which are 5.7 Gb in total, and this causes the db size to grow to 34Gb. This size is just one small sample of many such datasets that I need to import, so disk size is really an important factor. Regards, Peter -- http://sjamaan.ath.cx -- "The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music." -- Donald Knuth
2010/4/16 Peter Bex <Peter.Bex@xs4all.nl>
Hi all,
I have a table with three columns: one integer and two doubles.
There are two indexes defined (one on the integer and one on one
of the doubles). This table stores 700000 records, which take up
30 Mb according to pg_relation_size(), and the total relation size
is 66 Mb.
I expected the disk space usage to be halved by changing the doubles
to floats, but it only dropped by 5 MB! (I tried various approaches,
including dumping and restoring to make sure there was no uncollected
garbage lying around)
Someone on IRC told me the per-tuple storage overhead is pretty big,
and asked me to create a similar table containing only integers:
db=# create table testing ( x integer );
db=# INSERT INTO testing (x) VALUES (generate_series(1, 700000));
dacolt_development=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
pg_size_pretty
----------------
24 MB
(1 row)
db=# SELECT pg_size_pretty(pg_relation_size('testing'));
pg_size_pretty
----------------
24 MB
db=# CREATE INDEX testing_1 ON testing (x);
db=# CREATE INDEX testing_2 ON testing (x);
db=# SELECT pg_size_pretty(pg_relation_size('testing'));
pg_size_pretty
----------------
24 MB
(1 row)
db=# SELECT pg_size_pretty(pg_total_relation_size('testing'));
pg_size_pretty
----------------
54 MB
(1 row)
Is there a way to reduce the per-tuple storage overhead?
The reason I'm asking is that I have tons of tables like this,
and some data sets are much bigger than this. In a relatively
simple testcase I'm importing data from text files which are
5.7 Gb in total, and this causes the db size to grow to 34Gb.
This size is just one small sample of many such datasets that I
need to import, so disk size is really an important factor.
File pages are not fully filled from the start as that could result in bad performance of queries later. If you want to have those pages fully filled, then you can set the table parameter like fillfactor. But be aware that many queries can be much slower and changing those parameters isn't usually a good idea. If you won't ever do any updates on those tables then it could work.
regards
Szymon Guz
On Fri, Apr 16, 2010 at 11:59:38AM +0200, Szymon Guz wrote: > File pages are not fully filled from the start as that could result in bad > performance of queries later. The manual page you linked to says something else: "The fillfactor for a table is a percentage between 10 and 100. 100 (complete packing) is the default." However, the index has a default fill factor of 90, so I guess I can tweak that to 100 to shave off another few percent. (there will be no updates nor extra inserts on these tables) Thanks for the tip! I hope there are more ways to tweak it, though because this doesn't save that much. Regards, Peter -- http://sjamaan.ath.cx -- "The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music." -- Donald Knuth
2010/4/16 Peter Bex <Peter.Bex@xs4all.nl>
On Fri, Apr 16, 2010 at 11:59:38AM +0200, Szymon Guz wrote:The manual page you linked to says something else:
> File pages are not fully filled from the start as that could result in bad
> performance of queries later.
"The fillfactor for a table is a percentage between 10 and 100.
100 (complete packing) is the default."
However, the index has a default fill factor of 90, so I guess
I can tweak that to 100 to shave off another few percent.
(there will be no updates nor extra inserts on these tables)
Thanks for the tip! I hope there are more ways to tweak it, though
because this doesn't save that much.
I thought that the default fillfactor was much smaller (and haven't checked that now)... sorry for messing that up.
But let's think of it from the other side: what do you want to do with that data? Maybe PostgreSQL with it's MVCC's overhead isn't the best solution for your needs.
regards
Szymon Guz
On 16/04/10 10:41, Peter Bex wrote: > Hi all, > > I have a table with three columns: one integer and two doubles. > There are two indexes defined (one on the integer and one on one > of the doubles). This table stores 700000 records, which take up > 30 Mb according to pg_relation_size(), and the total relation size > is 66 Mb. [snip] > Is there a way to reduce the per-tuple storage overhead? Short answer - no. The database has to track visibility of every row - when it was inserted, deleted etc to support the MVCC concurrency system. http://www.postgresql.org/docs/8.4/static/storage-page-layout.html That means 24 bytes of overhead (on most systems) for each row. That's higher than some other RDBMS but they'll all have some overhead. > The reason I'm asking is that I have tons of tables like this, > and some data sets are much bigger than this. In a relatively > simple testcase I'm importing data from text files which are > 5.7 Gb in total, and this causes the db size to grow to 34Gb. Anything from double to ten times the size isn't unexpected, depending on row-sizes and how many indexes you are talking about. -- Richard Huxton Archonet Ltd
Richard Huxton <dev@archonet.com> writes: > On 16/04/10 10:41, Peter Bex wrote: >> Is there a way to reduce the per-tuple storage overhead? > Short answer - no. About the only thing you could really do is rethink the table layout. If you can put more data per row, then the fractional overhead for tuple headers naturally will be less. Whether this results in a convenient-to-use table is hard to guess without knowing your application ... regards, tom lane
On Fri, Apr 16, 2010 at 5:41 AM, Peter Bex <Peter.Bex@xs4all.nl> wrote: > Hi all, > > I have a table with three columns: one integer and two doubles. > There are two indexes defined (one on the integer and one on one > of the doubles). This table stores 700000 records, which take up > 30 Mb according to pg_relation_size(), and the total relation size > is 66 Mb. > > I expected the disk space usage to be halved by changing the doubles > to floats, but it only dropped by 5 MB! (I tried various approaches, > including dumping and restoring to make sure there was no uncollected > garbage lying around) > > Someone on IRC told me the per-tuple storage overhead is pretty big, > and asked me to create a similar table containing only integers: > > db=# create table testing ( x integer ); > db=# INSERT INTO testing (x) VALUES (generate_series(1, 700000)); > dacolt_development=# SELECT pg_size_pretty(pg_total_relation_size('testing')); > pg_size_pretty > ---------------- > 24 MB > (1 row) > db=# SELECT pg_size_pretty(pg_relation_size('testing')); > pg_size_pretty > ---------------- > 24 MB > db=# CREATE INDEX testing_1 ON testing (x); > db=# CREATE INDEX testing_2 ON testing (x); > db=# SELECT pg_size_pretty(pg_relation_size('testing')); > pg_size_pretty > ---------------- > 24 MB > (1 row) > db=# SELECT pg_size_pretty(pg_total_relation_size('testing')); > pg_size_pretty > ---------------- > 54 MB > (1 row) > > Is there a way to reduce the per-tuple storage overhead? > > The reason I'm asking is that I have tons of tables like this, > and some data sets are much bigger than this. In a relatively > simple testcase I'm importing data from text files which are > 5.7 Gb in total, and this causes the db size to grow to 34Gb. > > This size is just one small sample of many such datasets that I > need to import, so disk size is really an important factor. If you are storing big data and want to keep the overhead low, the first thing you need to examine is organizing your data into arrays. This involves tradeoffs of course and may not work but it's worth a shot! merlin
On Fri, Apr 16, 2010 at 12:40:21PM +0200, Szymon Guz wrote: > I thought that the default fillfactor was much smaller (and haven't checked > that now)... sorry for messing that up. > But let's think of it from the other side: what do you want to do with that > data? Maybe PostgreSQL with it's MVCC's overhead isn't the best solution for > your needs. I'm using this as part of a larger application. The data sets are one aspect of it. The idea is the following: An engineering application generates time-dependent data. One simulation yields a very big text file in tabular format, with hundreds or thousands of columns with output values (often more than Postgres' column limit), one row per timestamp. One such file is generated for each permutation of input values which influence the run of a simulation. This text file is imported into a database so we can perform very quick lookups on the numbers so they can be quickly plotted in a graph. The user can select any number of input permutations and graph the values of any selected output values to view the effect of the variation in input. One can plot any variable against any other, so one join is made for each variable that we want to plot; it joins the timestep values of the variable on the X axis to those on the Y axis. Regards, Peter -- http://sjamaan.ath.cx -- "The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music." -- Donald Knuth
On Fri, Apr 16, 2010 at 11:28:36AM -0400, Merlin Moncure wrote: > If you are storing big data and want to keep the overhead low, the > first thing you need to examine is organizing your data into arrays. > This involves tradeoffs of course and may not work but it's worth a > shot! That does sound interesting. However, I'm storing several hundreds or thousands of data points (depending on the data set). How is Postgresql's overhead when it comes to extracting one or two items from an array in a query? Cheers, Peter -- http://sjamaan.ath.cx -- "The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music." -- Donald Knuth