Thread: Database storage bloat
I'm developing a database for scientific recordings. These recordings are traditionally saved as binary flat files for simplicity and compact storage. Although I think ultimately having a database is better than 1,000s of flat files in terms of data access, I've found that the database (or at least my design) is pretty wasteful on storage space compared with the binary flat files. In particular, I tried importing all of the data from a binary flat file that is 1.35 MB into a PostgreSQL database (a very small test file; average production file is probably more like 100 MB). The database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the original storage of the binary flat file). Now I know that table design and normalizing is important. As far as my partner and I can tell, we've made good use of normalizing (no redundancy), we've set field sizes to their theoretical skinniness, and we've made use of foreign keys and views. I'm also aware that indicies/keys and other database internals will necessarily make the DBMS solution bloated in terms of storage space. However, a 65X bloat in space seems excessive. Has anyone run across similar storage concerns? I'd be interested in knowing if I just have really poorly designed tables, or if something else is going on here. I figure a bloat of 3-4X would be permissible (and possibly expected). But this bloat just seems too much. Thanks. -Tony
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Did you run vacuum full after your import ? On Thursday 08 April 2004 02:15 am, Tony Reina wrote: > I'm developing a database for scientific recordings. These recordings > are traditionally saved as binary flat files for simplicity and > compact storage. Although I think ultimately having a database is > better than 1,000s of flat files in terms of data access, I've found > that the database (or at least my design) is pretty wasteful on > storage space compared with the binary flat files. > > In particular, I tried importing all of the data from a binary flat > file that is 1.35 MB into a PostgreSQL database (a very small test > file; average production file is probably more like 100 MB). The > database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the > original storage of the binary flat file). > > Now I know that table design and normalizing is important. As far as > my partner and I can tell, we've made good use of normalizing (no > redundancy), we've set field sizes to their theoretical skinniness, > and we've made use of foreign keys and views. I'm also aware that > indicies/keys and other database internals will necessarily make the > DBMS solution bloated in terms of storage space. However, a 65X bloat > in space seems excessive. > > Has anyone run across similar storage concerns? I'd be interested in > knowing if I just have really poorly designed tables, or if something > else is going on here. I figure a bloat of 3-4X would be permissible > (and possibly expected). But this bloat just seems too much. > > Thanks. > -Tony > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org - -- UC - -- Open Source Solutions 4U, LLC 2570 Fleetwood Drive Phone: +1 650 872 2425 San Bruno, CA 94066 Cell: +1 650 302 2405 United States Fax: +1 650 872 2417 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQFAdSIajqGXBvRToM4RApqaAJ9wOZa8NAWqgPk4ZZpWgeh1L2VwqQCdHKgv 3ruwuPZRC/rOxX3nb1q6khE= =BoL3 -----END PGP SIGNATURE-----
Saying "we've set field sizes to their theoretical skinniness" makes me think that you may have the wrong data types. For example, you may have used CHAR and not VARCHAR. douglas Tony Reina wrote: >I'm developing a database for scientific recordings. These recordings >are traditionally saved as binary flat files for simplicity and >compact storage. Although I think ultimately having a database is >better than 1,000s of flat files in terms of data access, I've found >that the database (or at least my design) is pretty wasteful on >storage space compared with the binary flat files. > >In particular, I tried importing all of the data from a binary flat >file that is 1.35 MB into a PostgreSQL database (a very small test >file; average production file is probably more like 100 MB). The >database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the >original storage of the binary flat file). > >Now I know that table design and normalizing is important. As far as >my partner and I can tell, we've made good use of normalizing (no >redundancy), we've set field sizes to their theoretical skinniness, >[...] > >
Yep. That's after a 'vacuum verbose analyze'. -Tony ----- Original Message ----- From: "Uwe C. Schroeder" <uwe@oss4u.com> To: "Tony Reina" <reina_ga@hotmail.com>; <pgsql-admin@postgresql.org> Sent: Thursday, April 08, 2004 11:57 AM Subject: Re: [ADMIN] Database storage bloat -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Did you run vacuum full after your import ? On Thursday 08 April 2004 02:15 am, Tony Reina wrote: > I'm developing a database for scientific recordings. These recordings > are traditionally saved as binary flat files for simplicity and > compact storage. Although I think ultimately having a database is > better than 1,000s of flat files in terms of data access, I've found > that the database (or at least my design) is pretty wasteful on > storage space compared with the binary flat files. > > In particular, I tried importing all of the data from a binary flat > file that is 1.35 MB into a PostgreSQL database (a very small test > file; average production file is probably more like 100 MB). The > database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the > original storage of the binary flat file). > > Now I know that table design and normalizing is important. As far as > my partner and I can tell, we've made good use of normalizing (no > redundancy), we've set field sizes to their theoretical skinniness, > and we've made use of foreign keys and views. I'm also aware that > indicies/keys and other database internals will necessarily make the > DBMS solution bloated in terms of storage space. However, a 65X bloat > in space seems excessive. > > Has anyone run across similar storage concerns? I'd be interested in > knowing if I just have really poorly designed tables, or if something > else is going on here. I figure a bloat of 3-4X would be permissible > (and possibly expected). But this bloat just seems too much. > > Thanks. > -Tony > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org - -- UC - -- Open Source Solutions 4U, LLC 2570 Fleetwood Drive Phone: +1 650 872 2425 San Bruno, CA 94066 Cell: +1 650 302 2405 United States Fax: +1 650 872 2417 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQFAdSIajqGXBvRToM4RApqaAJ9wOZa8NAWqgPk4ZZpWgeh1L2VwqQCdHKgv 3ruwuPZRC/rOxX3nb1q6khE= =BoL3 -----END PGP SIGNATURE-----
I'm using 'text' instead of char. That seemed to cut the bloat down by about 30%. There's only about 11 tables in the DB. I included them at the bottom in case you're interested. Perhaps there's just something fundamentally boneheaded about my design (re-reading "Database Design for Mere Mortals" as I write). -Tony CREATE SEQUENCE FileIndicies_fileID_seq; CREATE TABLE FileIndicies ( fileID integer DEFAULT nextval('FileIndicies_fileID_seq') UNIQUE NOT NULL, --PK szFileName text NOT NULL, szFileType text, CreationDate date, CreationTime time, dtimestampresolution float4, dtimespan float4, szAppName text, szFileComment text, PRIMARY KEY (fileID), UNIQUE (szFileName, szFileType, CreationDate, CreationTime) ); CREATE SEQUENCE EntityFile_dbIndex_seq; CREATE TABLE EntityFile ( EntityID integer, --PK fileID integer REFERENCES FileIndicies (fileID), dbIndex integer DEFAULT nextval('EntityFile_dbIndex_seq') UNIQUE NOT NULL, PRIMARY KEY (fileID, EntityID) ); -- AnalogIndex CREATE TABLE AnalogIndicies ( dbIndex integer REFERENCES EntityFile (dbIndex), dsamplerate float4 CHECK (dsamplerate > (0)::float4), dminval float4, dmaxval float4 CHECK (dmaxval >= dminval), szunits text, dresolution float4 CHECK (dresolution > (0)::float4), dlocationx float4, dlocationy float4, dlocationz float4, dlocationuser float4, dhighfreqcorner float4 CHECK (dhighfreqcorner >= (0)::float4), dwhighfreqorder float4 CHECK (dwhighfreqorder >= (0)::float4), szhighfiltertype text, dlowfreqcorner float4 CHECK (dlowfreqcorner >= (0)::float4), dwlowfreqorder float4 CHECK (dwlowfreqorder >= (0)::float4), szlowfiltertype float4, szprobeinfo text, PRIMARY KEY (dbIndex) ); CREATE TABLE AnalogValues ( dbIndex integer REFERENCES EntityFile (dbIndex), dtimestamp float4 NOT NULL, dvalue float4, PRIMARY KEY (dbIndex, dtimestamp) ); CREATE TABLE EventIndicies ( dbIndex integer REFERENCES EntityFile (dbIndex), dweventtype smallint CHECK (dweventtype >=0 AND dweventtype <= 4), dwmindatalength smallint CHECK (dwmindatalength > (0)::float4), dwmaxdatalength smallint CHECK (dwmaxdatalength >= dwmindatalength), szcsvdesc text, PRIMARY KEY (dbIndex) ); CREATE TABLE EventValues ( dbIndex integer REFERENCES EntityFile (dbIndex), dtimestamp float4, dwdatabytesize smallint CHECK (dwdatabytesize > (0)::float4), eventvalue text, PRIMARY KEY (dbIndex, dtimestamp) ); CREATE TABLE NeuralIndicies ( dbIndex integer REFERENCES EntityFile (dbIndex), dwsourceentityid smallint, dwsourceunitid smallint, szprobeinfo text, PRIMARY KEY (dbIndex) ); CREATE TABLE NeuralValues ( dbIndex integer REFERENCES EntityFile (dbIndex), dtimestamp float4 NOT NULL, PRIMARY KEY (dbIndex, dtimestamp) ); CREATE TABLE SegmentIndicies ( dbIndex integer REFERENCES EntityFile (dbIndex), dwsourcecount smallint CHECK (dwsourcecount > (0)::float4), dwminsamplecount smallint CHECK (dwminsamplecount > (0)::float4), dwmaxsamplecount smallint CHECK (dwmaxsamplecount > (0)::float4), dsamplerate float4 CHECK (dsamplerate > (0)::float4), szunits text, PRIMARY KEY (dbIndex) ); CREATE TABLE SegmentSourceIndicies ( dbIndex integer REFERENCES EntityFile (dbIndex), dwsourceid integer CHECK (dwsourceid >= 0), dminval float4, dmaxval float4 CHECK (dmaxval >= dminval), dresolution float4 CHECK (dresolution > (0)::float4), dsubsampleshift float4, dlocationx float4, dlocationy float4, dlocationz float4, dlocationuser float4, dhighfreqcorner float4 CHECK (dhighfreqcorner >= (0)::float4), dwhighfreqorder float4 CHECK (dwhighfreqorder >= (0)::float4), szhighfiltertype text, dlowfreqcorner float4 CHECK (dlowfreqcorner >= (0)::float4), dwlowfreqorder float4 CHECK (dwlowfreqorder >= (0)::float4), szlowfiltertype text, szprobeinfo text, PRIMARY KEY (dbIndex, dwsourceid) ); CREATE TABLE SegmentValues ( dbIndex integer REFERENCES EntityFile (dbIndex), dwunitid smallint, dwsampleindex smallint, dtimestamp float4, dvalue float4, PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid) ); CREATE VIEW SegmentData AS SELECT a.szFileName, a.szFileType, a.creationdate, a.creationtime, c.EntityID, b.dwunitid, b.dwsampleindex, b.dtimestamp, b.dvalue FROM FileIndicies AS a, SegmentValues AS b, EntityFile AS c WHERE b.dbIndex = c.dbIndex AND a.fileID = c.fileID ORDER BY a.szFileName, a.szFileType, a.creationdate, a.creationtime, c.EntityID, b.dwunitid, b.dwsampleindex, b.dtimestamp; ----- Original Message ----- From: "Douglas Trainor" <trainor@uic.edu> To: "Tony Reina" <reina_ga@hotmail.com> Cc: <pgsql-admin@postgresql.org> Sent: Thursday, April 08, 2004 1:41 PM Subject: Re: [ADMIN] Database storage bloat > Saying "we've set field sizes to their theoretical skinniness" makes me > think that > you may have the wrong data types. For example, you may have used CHAR > and not VARCHAR. > > douglas > > Tony Reina wrote: > > >I'm developing a database for scientific recordings. These recordings > >are traditionally saved as binary flat files for simplicity and > >compact storage. Although I think ultimately having a database is > >better than 1,000s of flat files in terms of data access, I've found > >that the database (or at least my design) is pretty wasteful on > >storage space compared with the binary flat files. > > > >In particular, I tried importing all of the data from a binary flat > >file that is 1.35 MB into a PostgreSQL database (a very small test > >file; average production file is probably more like 100 MB). The > >database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the > >original storage of the binary flat file). > > > >Now I know that table design and normalizing is important. As far as > >my partner and I can tell, we've made good use of normalizing (no > >redundancy), we've set field sizes to their theoretical skinniness, > >[...] > > > > > >
On Thu, 8 Apr 2004, Tony Reina wrote: > Has anyone run across similar storage concerns? I'd be interested in > knowing if I just have really poorly designed tables, or if something > else is going on here. I figure a bloat of 3-4X would be permissible > (and possibly expected). But this bloat just seems too much. Yes, however my example is not very accurate because we are still running an old version (7.2.1, Debian stable). In our case, dump file created from pg_dumpall is around 4 MB, but the data directory is around 180MB. Regards. -- Nicolas
On Thu, 8 Apr 2004, Tony Reina wrote: > I'm developing a database for scientific recordings. These recordings > are traditionally saved as binary flat files for simplicity and > compact storage. Although I think ultimately having a database is > better than 1,000s of flat files in terms of data access, I've found > that the database (or at least my design) is pretty wasteful on > storage space compared with the binary flat files. > > In particular, I tried importing all of the data from a binary flat > file that is 1.35 MB into a PostgreSQL database (a very small test > file; average production file is probably more like 100 MB). The > database directory ballooned from 4.1 MB to 92 MB (a bloat of 65X the > original storage of the binary flat file). Well, an important question is where is that space going? It'd be interesting to give a breakup by the directories and then which files (and using the contrib/oid2name to get which table/indexes/etc they are). At least 16MB of that is probably going into the transaction log (IIRC that's the default size for the segments) in pg_xlog.
reina_ga@hotmail.com (Tony Reina) writes: > However, a 65X bloat in space seems excessive. Without concrete details about your schema, it's hard to answer that. I suspect you made some inefficient choices, but have no data. For starters you should do a database-wide VACUUM to ensure pg_class.relpages is up to date, then look to see which tables and indexes contain the most pages. regards, tom lane
> Well, an important question is where is that space going? It'd be > interesting to give a breakup by the directories and then which files (and > using the contrib/oid2name to get which table/indexes/etc they are). > > At least 16MB of that is probably going into the transaction log (IIRC > that's the default size for the segments) in pg_xlog. > Thanks Stephan. That at least helps me narrow it down to my one problem table: segmentvalues. It is taking up almost 50 MB and its primary key is over 38MB. Nothing else even comes close. Here's the relevant output for oid2name in order of size. SIZE (KB) OID TABLENAME 48844 17296 = segmentvalues 38100 17298 = segmentvalues_pkey 1108 16642 = pg_proc_proname_args_nsp_index 852 17246 = neuralvalues 676 17248 = neuralvalues_pkey .. .. Tom asked to see the table schema. Here's the 3 relevant tables for the bloat: CREATE SEQUENCE FileIndicies_fileID_seq; CREATE TABLE FileIndicies ( fileID integer DEFAULT nextval('FileIndicies_fileID_seq') UNIQUE NOT NULL, --PK szFileName text NOT NULL, szFileType text, CreationDate date, CreationTime time, dtimestampresolution float4, dtimespan float4, szAppName text, szFileComment text, PRIMARY KEY (fileID), UNIQUE (szFileName, szFileType, CreationDate, CreationTime) ); CREATE SEQUENCE EntityFile_dbIndex_seq; CREATE TABLE EntityFile ( EntityID integer, --PK fileID integer REFERENCES FileIndicies (fileID), dbIndex integer DEFAULT nextval('EntityFile_dbIndex_seq') UNIQUE NOT NULL, PRIMARY KEY (fileID, EntityID) ); CREATE TABLE SegmentValues ( dbIndex integer REFERENCES EntityFile (dbIndex), dwunitid smallint, dwsampleindex smallint, dtimestamp float4, dvalue float4, PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid) ); I suppose one thing the binary flat file may be doing is not including the time stamp in table SegmentValues. Since I know the sampling rate, I can just calculate the timestamp on the fly by the rate times the index (assuming no time offset). That would lose a float4 field, but would add back a smallint field to the table. Is there any further normalization that I'm missing? -Tony
"Tony and Bryn Reina" <reina_ga@hotmail.com> writes: > There's only about 11 tables in the DB. I included them at the bottom > in case you're interested. What are the actual sizes of the tables? Probably the most useful way you could present that info is "vacuum verbose" output for each table. regards, tom lane
On Thursday 08 April 2004 5:51 am, Tony and Bryn Reina wrote: > Yep. That's after a 'vacuum verbose analyze'. No, he asked if you had run a "vacuum full". A "standard" vacuum just marks space available for reuse - it does not shrink file sizes. A "vacuum full" will shrink the files on disk. Are you doing many updates on the table that is the space-using culprit? Each record that is updated is created anew. The old record is marked as "defunct" when no remaining transactions need that record. If you were to, say, update the entire table to change a single character field you would basically double the size of the table. A vacuum will mark the space that is no longer used as available for reuse so if you run regular vacuums you should achieve a fairly static size. One thing to consider: since this is how PostgreSQL achieves MVCC you may want to try to avoid updates to a table that has huge record sizes. For example, say you wanted to store a "last viewed date" for each file. If you had that date in the same table with your data every view would add considerable "bloat" to your table. If, instead, your file is in a table by itself along with a sequence or other unique key then the "last viewed date" or other changing data could be put into a separate table linked to the file storage table by that unique ID. Updating the date or other info about the file would only increase the bloat in the other table and the bloat would be small. Performance would probably improve as well due to smaller file sizes and less read/write action on the disks. Cheers, Steve
"Tony and Bryn Reina" <reina_ga@hotmail.com> writes: > CREATE TABLE SegmentValues ( > dbIndex integer REFERENCES EntityFile (dbIndex), > dwunitid smallint, > dwsampleindex smallint, > dtimestamp float4, > dvalue float4, > PRIMARY KEY (dbIndex, dtimestamp, dwsampleindex, dwunitid) > ); > I suppose one thing the binary flat file may be doing is not including the > time stamp in table SegmentValues. Since I know the sampling rate, I can > just calculate the timestamp on the fly by the rate times the index > (assuming no time offset). That would lose a float4 field, but would add > back a smallint field to the table. That won't buy you anything at all --- the two bytes saved would be lost again to alignment padding. (I'm assuming you're on PC hardware with MAXALIGN = 4 bytes.) I don't see orders-of-magnitude bloat here though. You've got 16 bytes of useful data per row (which I suppose was 12 bytes in the flat file?). There will be 28 bytes of overhead per table row. In addition the index will require 12 data bytes + 12 overhead bytes per entry; allowing for the fact that b-tree only likes to pack pages about 2/3ds full, we could estimate index size as about 36 bytes per original row, giving an aggregate bloat factor of 6.67X compared to a binary flat file if the flat file needed 12 bytes per row. The only way I could see to get to a 65X bloat factor would be if you'd repeatedly updated the table rows without vacuuming. regards, tom lane
> I don't see orders-of-magnitude bloat here though. You've got 16 bytes > of useful data per row (which I suppose was 12 bytes in the flat file?). > There will be 28 bytes of overhead per table row. In addition the index > will require 12 data bytes + 12 overhead bytes per entry; allowing for > the fact that b-tree only likes to pack pages about 2/3ds full, we could > estimate index size as about 36 bytes per original row, giving an > aggregate bloat factor of 6.67X compared to a binary flat file if the > flat file needed 12 bytes per row. > > The only way I could see to get to a 65X bloat factor would be if you'd > repeatedly updated the table rows without vacuuming. Thanks Tom (and everyone else). At least I feel more comfortable that there's not something obvious. I did perform a 'vacuum full', but still no dice. At least thanks to Stephan I've narrowed my search down to one table and I have a ballpark theoretical bloat to shoot for (6 .67X). I'm going to go back and try to determine if I'm correctly interpretting the part of the flat file that holds this table's data. -Tony
I thought I'd make a followup to the question of storage bloat. I tried approaching the problem by living a little loosely with database normalization and use arrays instead of flattening everything out in the tables. So my original table, CREATE TABLE original ( dbIndex integer, index1 smallint, index2 smallint, index3 smallint, dvalue real ): becomes CREATE TABLE newtable ( dbIndex integer, dvalue real[][][] ); Now I'd expect to get better storage here just because I'm dropping 3 smallint fields (so 2 bytes x 3 for each value, or a 6 fold decrease). However, in actuality I'm seeing a 30 fold decrease: original table = 48 MB newtable = 1.6 MB !!!!! If I'm understanding the data directory correctly, the array tuples are being toasted (which I think must compress them). The actual table in the newtable format is only 8K and the pg_toast that goes with this table is 1.6MB. So I'm very pleased with the storage here. Plus, reads are faster than the original design. All in all, I think I've found a good solution for my kind of data. Now I know that the use of arrays is considered taboo in the books, but in my case the data being stored actually do fall into an array (matrix) structure naturally (it's a timebased recording so there's a sample for each time point and each X,Y position -- a natural 3D matrix). What are the possible downsides that I may be missing?