Thread: Database storage bloat

Database storage bloat

From
reina_ga@hotmail.com (Tony Reina)
Date:
I'm developing a database for scientific recordings. These recordings
are traditionally saved as binary flat files for simplicity and
compact storage. Although I think ultimately having a database is
better than 1,000s of flat files in terms of data access, I've found
that the database (or at least my design) is pretty wasteful on
storage space compared with the binary flat files.

In particular, I tried importing all of the data from a binary flat
file that is 1.35 MB into a PostgreSQL database (a very small test
file; average production file is probably more like 100 MB). The
database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
original storage of the binary flat file).

Now I know that table design and normalizing is important. As far as
my partner and I can tell, we've made good use of normalizing (no
redundancy), we've set field sizes to their theoretical skinniness,
and we've made use of foreign keys and views. I'm also aware that
indicies/keys and other database internals will necessarily make the
DBMS solution bloated in terms of storage space. However, a 65X bloat
in space seems excessive.

Has anyone run across similar storage concerns? I'd be interested in
knowing if I just have really poorly designed tables, or if something
else is going on here. I figure a bloat of 3-4X would be permissible
(and possibly expected). But this bloat just seems too much.

Thanks.
-Tony

Re: Database storage bloat

From
"Uwe C. Schroeder"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Did you run vacuum full after your import ?

On Thursday 08 April 2004 02:15 am, Tony Reina wrote:
> I'm developing a database for scientific recordings. These recordings
> are traditionally saved as binary flat files for simplicity and
> compact storage. Although I think ultimately having a database is
> better than 1,000s of flat files in terms of data access, I've found
> that the database (or at least my design) is pretty wasteful on
> storage space compared with the binary flat files.
>
> In particular, I tried importing all of the data from a binary flat
> file that is 1.35 MB into a PostgreSQL database (a very small test
> file; average production file is probably more like 100 MB). The
> database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
> original storage of the binary flat file).
>
> Now I know that table design and normalizing is important. As far as
> my partner and I can tell, we've made good use of normalizing (no
> redundancy), we've set field sizes to their theoretical skinniness,
> and we've made use of foreign keys and views. I'm also aware that
> indicies/keys and other database internals will necessarily make the
> DBMS solution bloated in terms of storage space. However, a 65X bloat
> in space seems excessive.
>
> Has anyone run across similar storage concerns? I'd be interested in
> knowing if I just have really poorly designed tables, or if something
> else is going on here. I figure a bloat of 3-4X would be permissible
> (and possibly expected). But this bloat just seems too much.
>
> Thanks.
> -Tony
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org

- --
    UC

- --
Open Source Solutions 4U, LLC    2570 Fleetwood Drive
Phone:  +1 650 872 2425        San Bruno, CA 94066
Cell:   +1 650 302 2405        United States
Fax:    +1 650 872 2417
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFAdSIajqGXBvRToM4RApqaAJ9wOZa8NAWqgPk4ZZpWgeh1L2VwqQCdHKgv
3ruwuPZRC/rOxX3nb1q6khE=
=BoL3
-----END PGP SIGNATURE-----


Re: Database storage bloat

From
Douglas Trainor
Date:
Saying "we've set field sizes to their theoretical skinniness" makes me
think that
you may have the wrong data types.  For example, you may have used CHAR
and not VARCHAR.

    douglas

Tony Reina wrote:

>I'm developing a database for scientific recordings. These recordings
>are traditionally saved as binary flat files for simplicity and
>compact storage. Although I think ultimately having a database is
>better than 1,000s of flat files in terms of data access, I've found
>that the database (or at least my design) is pretty wasteful on
>storage space compared with the binary flat files.
>
>In particular, I tried importing all of the data from a binary flat
>file that is 1.35 MB into a PostgreSQL database (a very small test
>file; average production file is probably more like 100 MB). The
>database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
>original storage of the binary flat file).
>
>Now I know that table design and normalizing is important. As far as
>my partner and I can tell, we've made good use of normalizing (no
>redundancy), we've set field sizes to their theoretical skinniness,
>[...]
>
>


Re: Database storage bloat

From
"Tony and Bryn Reina"
Date:
Yep. That's after a 'vacuum verbose analyze'.

-Tony

----- Original Message -----
From: "Uwe C. Schroeder" <uwe@oss4u.com>
To: "Tony Reina" <reina_ga@hotmail.com>; <pgsql-admin@postgresql.org>
Sent: Thursday, April 08, 2004 11:57 AM
Subject: Re: [ADMIN] Database storage bloat


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Did you run vacuum full after your import ?

On Thursday 08 April 2004 02:15 am, Tony Reina wrote:
> I'm developing a database for scientific recordings. These recordings
> are traditionally saved as binary flat files for simplicity and
> compact storage. Although I think ultimately having a database is
> better than 1,000s of flat files in terms of data access, I've found
> that the database (or at least my design) is pretty wasteful on
> storage space compared with the binary flat files.
>
> In particular, I tried importing all of the data from a binary flat
> file that is 1.35 MB into a PostgreSQL database (a very small test
> file; average production file is probably more like 100 MB). The
> database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
> original storage of the binary flat file).
>
> Now I know that table design and normalizing is important. As far as
> my partner and I can tell, we've made good use of normalizing (no
> redundancy), we've set field sizes to their theoretical skinniness,
> and we've made use of foreign keys and views. I'm also aware that
> indicies/keys and other database internals will necessarily make the
> DBMS solution bloated in terms of storage space. However, a 65X bloat
> in space seems excessive.
>
> Has anyone run across similar storage concerns? I'd be interested in
> knowing if I just have really poorly designed tables, or if something
> else is going on here. I figure a bloat of 3-4X would be permissible
> (and possibly expected). But this bloat just seems too much.
>
> Thanks.
> -Tony
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org

- --
UC

- --
Open Source Solutions 4U, LLC 2570 Fleetwood Drive
Phone:  +1 650 872 2425 San Bruno, CA 94066
Cell:   +1 650 302 2405 United States
Fax:    +1 650 872 2417
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFAdSIajqGXBvRToM4RApqaAJ9wOZa8NAWqgPk4ZZpWgeh1L2VwqQCdHKgv
3ruwuPZRC/rOxX3nb1q6khE=
=BoL3
-----END PGP SIGNATURE-----


Re: Database storage bloat

From
"Tony and Bryn Reina"
Date:
I'm using 'text' instead of char. That seemed to cut the bloat down by about
30%. There's only about 11 tables in the DB. I included them at the bottom
in case you're interested.

Perhaps there's just something fundamentally boneheaded about my design
(re-reading "Database Design for Mere Mortals" as I write).

-Tony

CREATE SEQUENCE FileIndicies_fileID_seq;
CREATE TABLE FileIndicies (
 fileID      integer DEFAULT nextval('FileIndicies_fileID_seq') UNIQUE NOT
NULL,  --PK
 szFileName              text NOT NULL,
 szFileType  text,
 CreationDate  date,
 CreationTime  time,
 dtimestampresolution float4,
 dtimespan  float4,
 szAppName  text,
 szFileComment  text,
 PRIMARY KEY (fileID),
 UNIQUE (szFileName, szFileType, CreationDate, CreationTime)
);

CREATE SEQUENCE EntityFile_dbIndex_seq;
CREATE TABLE EntityFile (
 EntityID  integer, --PK
 fileID                  integer REFERENCES FileIndicies (fileID),
 dbIndex   integer DEFAULT nextval('EntityFile_dbIndex_seq') UNIQUE NOT
NULL,
 PRIMARY KEY (fileID, EntityID)
);

-- AnalogIndex
CREATE TABLE AnalogIndicies (
 dbIndex   integer REFERENCES EntityFile (dbIndex),
 dsamplerate  float4 CHECK (dsamplerate > (0)::float4),
 dminval   float4,
 dmaxval   float4 CHECK (dmaxval >= dminval),
 szunits   text,
 dresolution  float4 CHECK (dresolution > (0)::float4),
 dlocationx  float4,
 dlocationy  float4,
 dlocationz  float4,
 dlocationuser  float4,
 dhighfreqcorner  float4 CHECK (dhighfreqcorner >= (0)::float4),
 dwhighfreqorder  float4 CHECK (dwhighfreqorder >= (0)::float4),
 szhighfiltertype text,
 dlowfreqcorner  float4 CHECK (dlowfreqcorner >= (0)::float4),
 dwlowfreqorder  float4 CHECK (dwlowfreqorder >= (0)::float4),
 szlowfiltertype  float4,
 szprobeinfo  text,
 PRIMARY KEY (dbIndex)
);

CREATE TABLE AnalogValues (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dtimestamp  float4 NOT NULL,
 dvalue   float4,
 PRIMARY KEY (dbIndex, dtimestamp)
);

CREATE TABLE EventIndicies (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dweventtype  smallint CHECK (dweventtype >=0 AND dweventtype <= 4),
 dwmindatalength  smallint CHECK (dwmindatalength > (0)::float4),
 dwmaxdatalength  smallint CHECK (dwmaxdatalength >= dwmindatalength),
 szcsvdesc  text,
 PRIMARY KEY (dbIndex)
);

CREATE TABLE EventValues (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dtimestamp  float4,
 dwdatabytesize  smallint CHECK (dwdatabytesize > (0)::float4),
 eventvalue  text,
 PRIMARY KEY (dbIndex, dtimestamp)
);

CREATE TABLE NeuralIndicies (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dwsourceentityid smallint,
 dwsourceunitid  smallint,
 szprobeinfo  text,
 PRIMARY KEY (dbIndex)
);

CREATE TABLE NeuralValues (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dtimestamp  float4 NOT NULL,
 PRIMARY KEY (dbIndex, dtimestamp)
);

CREATE TABLE SegmentIndicies (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dwsourcecount  smallint CHECK (dwsourcecount > (0)::float4),
 dwminsamplecount smallint CHECK (dwminsamplecount > (0)::float4),
 dwmaxsamplecount smallint CHECK (dwmaxsamplecount > (0)::float4),
 dsamplerate  float4 CHECK (dsamplerate > (0)::float4),
 szunits   text,
 PRIMARY KEY (dbIndex)
);

CREATE TABLE SegmentSourceIndicies (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dwsourceid              integer CHECK (dwsourceid >= 0),
 dminval   float4,
 dmaxval   float4 CHECK (dmaxval >= dminval),
 dresolution  float4 CHECK (dresolution > (0)::float4),
 dsubsampleshift  float4,
 dlocationx  float4,
 dlocationy  float4,
 dlocationz  float4,
 dlocationuser  float4,
 dhighfreqcorner  float4 CHECK (dhighfreqcorner >= (0)::float4),
 dwhighfreqorder  float4 CHECK (dwhighfreqorder >= (0)::float4),
 szhighfiltertype text,
 dlowfreqcorner  float4 CHECK (dlowfreqcorner >= (0)::float4),
 dwlowfreqorder  float4 CHECK (dwlowfreqorder >= (0)::float4),
 szlowfiltertype  text,
 szprobeinfo  text,
 PRIMARY KEY (dbIndex, dwsourceid)
);

CREATE TABLE SegmentValues (
 dbIndex                 integer REFERENCES EntityFile (dbIndex),
 dwunitid  smallint,
 dwsampleindex  smallint,
 dtimestamp  float4,
 dvalue   float4,
 PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid)
);

CREATE VIEW SegmentData AS
 SELECT a.szFileName, a.szFileType, a.creationdate, a.creationtime,
 c.EntityID,
 b.dwunitid, b.dwsampleindex, b.dtimestamp, b.dvalue
 FROM FileIndicies AS a, SegmentValues AS b, EntityFile AS c
 WHERE
 b.dbIndex = c.dbIndex AND
 a.fileID = c.fileID
 ORDER BY a.szFileName, a.szFileType, a.creationdate, a.creationtime,
c.EntityID,
 b.dwunitid, b.dwsampleindex, b.dtimestamp;




----- Original Message -----
From: "Douglas Trainor" <trainor@uic.edu>
To: "Tony Reina" <reina_ga@hotmail.com>
Cc: <pgsql-admin@postgresql.org>
Sent: Thursday, April 08, 2004 1:41 PM
Subject: Re: [ADMIN] Database storage bloat


> Saying "we've set field sizes to their theoretical skinniness" makes me
> think that
> you may have the wrong data types.  For example, you may have used CHAR
> and not VARCHAR.
>
>     douglas
>
> Tony Reina wrote:
>
> >I'm developing a database for scientific recordings. These recordings
> >are traditionally saved as binary flat files for simplicity and
> >compact storage. Although I think ultimately having a database is
> >better than 1,000s of flat files in terms of data access, I've found
> >that the database (or at least my design) is pretty wasteful on
> >storage space compared with the binary flat files.
> >
> >In particular, I tried importing all of the data from a binary flat
> >file that is 1.35 MB into a PostgreSQL database (a very small test
> >file; average production file is probably more like 100 MB). The
> >database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
> >original storage of the binary flat file).
> >
> >Now I know that table design and normalizing is important. As far as
> >my partner and I can tell, we've made good use of normalizing (no
> >redundancy), we've set field sizes to their theoretical skinniness,
> >[...]
> >
> >
>
>

Re: Database storage bloat

From
Nicolas Kowalski
Date:
On Thu, 8 Apr 2004, Tony Reina wrote:

> Has anyone run across similar storage concerns? I'd be interested in
> knowing if I just have really poorly designed tables, or if something
> else is going on here. I figure a bloat of 3-4X would be permissible
> (and possibly expected). But this bloat just seems too much.

Yes, however my example is not very accurate because we are still
running an old version (7.2.1, Debian stable).

In our case, dump file created from pg_dumpall is around 4 MB, but the
data directory is around 180MB.

Regards.

--
Nicolas

Re: Database storage bloat

From
Stephan Szabo
Date:
On Thu, 8 Apr 2004, Tony Reina wrote:

> I'm developing a database for scientific recordings. These recordings
> are traditionally saved as binary flat files for simplicity and
> compact storage. Although I think ultimately having a database is
> better than 1,000s of flat files in terms of data access, I've found
> that the database (or at least my design) is pretty wasteful on
> storage space compared with the binary flat files.
>
> In particular, I tried importing all of the data from a binary flat
> file that is 1.35 MB into a PostgreSQL database (a very small test
> file; average production file is probably more like 100 MB). The
> database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the
> original storage of the binary flat file).

Well, an important question is where is that space going? It'd be
interesting to give a breakup by the directories and then which files (and
using the contrib/oid2name to get which table/indexes/etc they are).

At least 16MB of that is probably going into the transaction log (IIRC
that's the default size for the segments) in pg_xlog.

Re: Database storage bloat

From
Tom Lane
Date:
reina_ga@hotmail.com (Tony Reina) writes:
> However, a 65X bloat in space seems excessive.

Without concrete details about your schema, it's hard to answer that.
I suspect you made some inefficient choices, but have no data.

For starters you should do a database-wide VACUUM to ensure
pg_class.relpages is up to date, then look to see which tables and
indexes contain the most pages.

            regards, tom lane

Re: Database storage bloat

From
"Tony and Bryn Reina"
Date:
> Well, an important question is where is that space going? It'd be
> interesting to give a breakup by the directories and then which files (and
> using the contrib/oid2name to get which table/indexes/etc they are).
>
> At least 16MB of that is probably going into the transaction log (IIRC
> that's the default size for the segments) in pg_xlog.
>

Thanks Stephan. That at least helps me narrow it down to my one problem
table: segmentvalues. It is taking up almost 50 MB and its primary key is
over 38MB. Nothing else even comes close. Here's the relevant output for
oid2name in order of size.

SIZE (KB)           OID          TABLENAME
48844                  17296  = segmentvalues
38100                  17298  = segmentvalues_pkey
1108                     16642  = pg_proc_proname_args_nsp_index
852                      17246  = neuralvalues
676                      17248  = neuralvalues_pkey
..
..

Tom asked to see the table schema. Here's the 3 relevant tables for the
bloat:

CREATE SEQUENCE FileIndicies_fileID_seq;
CREATE TABLE FileIndicies (
     fileID      integer DEFAULT nextval('FileIndicies_fileID_seq') UNIQUE
NOT NULL,  --PK
     szFileName              text NOT NULL,
     szFileType  text,
     CreationDate  date,
     CreationTime  time,
     dtimestampresolution float4,
     dtimespan  float4,
     szAppName  text,
     szFileComment  text,
     PRIMARY KEY (fileID),
     UNIQUE (szFileName, szFileType, CreationDate, CreationTime)
);

CREATE SEQUENCE EntityFile_dbIndex_seq;
CREATE TABLE EntityFile (
     EntityID  integer, --PK
     fileID                  integer REFERENCES FileIndicies (fileID),
     dbIndex   integer DEFAULT nextval('EntityFile_dbIndex_seq') UNIQUE NOT
NULL,
     PRIMARY KEY (fileID, EntityID)
);


CREATE TABLE SegmentValues (
     dbIndex                 integer REFERENCES EntityFile (dbIndex),
     dwunitid  smallint,
     dwsampleindex  smallint,
     dtimestamp  float4,
     dvalue   float4,
     PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid)
);

I suppose one thing the binary flat file may be doing is not including the
time stamp in table SegmentValues. Since I know the sampling rate, I can
just calculate the timestamp on the fly by the rate times the index
(assuming no time offset). That would lose a float4 field, but would add
back a smallint field to the table.

Is there any further normalization that I'm missing?

-Tony







Re: Database storage bloat

From
Tom Lane
Date:
"Tony and Bryn Reina" <reina_ga@hotmail.com> writes:
> There's only about 11 tables in the DB. I included them at the bottom
> in case you're interested.

What are the actual sizes of the tables?  Probably the most useful way
you could present that info is "vacuum verbose" output for each table.

            regards, tom lane

Re: Database storage bloat

From
Steve Crawford
Date:
On Thursday 08 April 2004 5:51 am, Tony and Bryn Reina wrote:
> Yep. That's after a 'vacuum verbose analyze'.

No, he asked if you had run a "vacuum full". A "standard" vacuum just
marks space available for reuse - it does not shrink file sizes. A
"vacuum full" will shrink the files on disk.

Are you doing many updates on the table that is the space-using
culprit? Each record that is updated is created anew. The old record
is marked as "defunct" when no remaining transactions need that
record. If you were to, say, update the entire table to change a
single character field you would basically double the size of the
table.

A vacuum will mark the space that is no longer used as available for
reuse so if you run regular vacuums you should achieve a fairly
static size.

One thing to consider: since this is how PostgreSQL achieves MVCC you
may want to try to avoid updates to a table that has huge record
sizes.

For example, say you wanted to store a "last viewed date" for each
file. If you had that date in the same table with your data every
view would add considerable "bloat" to your table. If, instead, your
file is in a table by itself along with a sequence or other unique
key then the "last viewed date" or other changing data could be put
into a separate table linked to the file storage table by that unique
ID. Updating the date or other info about the file would only
increase the bloat in the other table and the bloat would be small.

Performance would probably improve as well due to smaller file sizes
and less read/write action on the disks.

Cheers,
Steve


Re: Database storage bloat

From
Tom Lane
Date:
"Tony and Bryn Reina" <reina_ga@hotmail.com> writes:
> CREATE TABLE SegmentValues (
>      dbIndex                 integer REFERENCES EntityFile (dbIndex),
>      dwunitid  smallint,
>      dwsampleindex  smallint,
>      dtimestamp  float4,
>      dvalue   float4,
>      PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid)
> );

> I suppose one thing the binary flat file may be doing is not including the
> time stamp in table SegmentValues. Since I know the sampling rate, I can
> just calculate the timestamp on the fly by the rate times the index
> (assuming no time offset). That would lose a float4 field, but would add
> back a smallint field to the table.

That won't buy you anything at all --- the two bytes saved would be lost
again to alignment padding.  (I'm assuming you're on PC hardware with
MAXALIGN = 4 bytes.)

I don't see orders-of-magnitude bloat here though.  You've got 16 bytes
of useful data per row (which I suppose was 12 bytes in the flat file?).
There will be 28 bytes of overhead per table row.  In addition the index
will require 12 data bytes + 12 overhead bytes per entry; allowing for
the fact that b-tree only likes to pack pages about 2/3ds full, we could
estimate index size as about 36 bytes per original row, giving an
aggregate bloat factor of 6.67X compared to a binary flat file if the
flat file needed 12 bytes per row.

The only way I could see to get to a 65X bloat factor would be if you'd
repeatedly updated the table rows without vacuuming.

            regards, tom lane

Re: Database storage bloat

From
"Tony and Bryn Reina"
Date:
> I don't see orders-of-magnitude bloat here though.  You've got 16 bytes
> of useful data per row (which I suppose was 12 bytes in the flat file?).
> There will be 28 bytes of overhead per table row.  In addition the index
> will require 12 data bytes + 12 overhead bytes per entry; allowing for
> the fact that b-tree only likes to pack pages about 2/3ds full, we could
> estimate index size as about 36 bytes per original row, giving an
> aggregate bloat factor of 6.67X compared to a binary flat file if the
> flat file needed 12 bytes per row.
>
> The only way I could see to get to a 65X bloat factor would be if you'd
> repeatedly updated the table rows without vacuuming.

Thanks Tom (and everyone else). At least I feel more comfortable that
there's not something obvious. I did perform a 'vacuum full', but still no
dice. At least thanks to Stephan I've narrowed my search down to one table
and I have a ballpark theoretical bloat to shoot for (6 .67X). I'm going to
go back and try to determine if I'm correctly interpretting the part of the
flat file that holds this table's data.

-Tony

Re: Database storage bloat -- FOLLOWUP

From
reina_ga@hotmail.com (Tony Reina)
Date:
I thought I'd make a followup to the question of storage bloat. I
tried approaching the problem by living a little loosely with database
normalization and use arrays instead of flattening everything out in
the tables.

So my original table,

CREATE TABLE original (
dbIndex integer,
index1 smallint,
index2 smallint,
index3 smallint,
dvalue real
):

becomes

CREATE TABLE newtable (
dbIndex integer,
dvalue   real[][][]
);

Now I'd expect to get better storage here just because I'm dropping 3
smallint fields (so 2 bytes x 3 for each value, or a 6 fold decrease).

However, in actuality I'm seeing a 30 fold decrease:

original table = 48 MB
newtable       = 1.6 MB !!!!!

If I'm understanding the data directory correctly, the array tuples
are being toasted (which I think must compress them). The actual table
in the newtable format is only 8K and the pg_toast that goes with this
table is 1.6MB.

So I'm very pleased with the storage here. Plus, reads are faster than
the original design. All in all, I think I've found a good solution
for my kind of data.

Now I know that the use of arrays is considered taboo in the books,
but in my case the data being stored actually do fall into an array
(matrix)  structure naturally (it's a timebased recording so there's a
sample for each time point and each X,Y position -- a natural 3D
matrix).

What are the possible downsides that I may be missing?