Thread: Streaming large data into postgres [WORM like applications]
Here is the straight dope, one of internal teams at my customer site is looking into MySql and replacing its storage engine so that they can store large amount of streamed data. The key here is that the data they are getting is several thousands of rows in an extremely short duration. They say that only MySql provides them the ability to replace the storage engine, which granted is easier. If I go with the statement that postgres can basically do what they intend to do for handling large datasets, I need to prepare my talking points. The requirements are as follows: 1. Large amount of streamed rows. In the order of @50-100k rows per second. I was thinking that the rows can be stored into a file and the file then copied into a temp table using copy and then appending those rows to the master table. And then dropping and recreating the index very lazily [during the first query hit or something like that] The table size can grow extremely large. Of course, if it can be partitioned, either by range or list. 2. Most of the streamed rows are very similar. Think syslog rows, where for most cases only the timestamp changes. Of course, if the data can be compressed, it will result in improved savings in terms of disk size. The key issue here is that the ultimate data usage is Write Once Read Many, and in that sense I am looking for a very optimal solution for bulk writes and maintaining indexes during bulk writes. So with some intelligent design, it is possible to use postgres. Any help in preparing my talking points is appreciated. Regards Dhaval
Inserting 50,000 rows a second is, uh... difficult to do, no matter what database you're using. You'll probably have to spool the inserts and insert them as fast as you can, and just hope you don't fall too far behind. But I'm suspecting that you aren't going to be doing much, if any, referential integrity checking, at least beyond basic type checking. You probably aren't going to care about multiple inserts affecting each other, or worry about corruption if a given insert fails... in fact, you probably aren't even going to need transactions at all, other than as a way to insert faster. Is SQL the right tool for you? On May 11, 2007, at 1:43 PM, Dhaval Shah wrote: > Here is the straight dope, one of internal teams at my customer site > is looking into MySql and replacing its storage engine so that they > can store large amount of streamed data. The key here is that the data > they are getting is several thousands of rows in an extremely short > duration. They say that only MySql provides them the ability to > replace the storage engine, which granted is easier. > > If I go with the statement that postgres can basically do what they > intend to do for handling large datasets, I need to prepare my talking > points. > > The requirements are as follows: > > 1. Large amount of streamed rows. In the order of @50-100k rows per > second. I was thinking that the rows can be stored into a file and the > file then copied into a temp table using copy and then appending those > rows to the master table. And then dropping and recreating the index > very lazily [during the first query hit or something like that] > > The table size can grow extremely large. Of course, if it can be > partitioned, either by range or list. > > 2. Most of the streamed rows are very similar. Think syslog rows, > where for most cases only the timestamp changes. Of course, if the > data can be compressed, it will result in improved savings in terms of > disk size. > > The key issue here is that the ultimate data usage is Write Once Read > Many, and in that sense I am looking for a very optimal solution for > bulk writes and maintaining indexes during bulk writes. > > So with some intelligent design, it is possible to use postgres. Any > help in preparing my talking points is appreciated. > > Regards > Dhaval > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings
I do care about the following: 1. Basic type checking 2. Knowing failed inserts. 3. Non-corruption 4. Macro transactions. That is a minimal read consistency. The following is not necessary 1. Referential integrity In this particular scenario, 1. There is a sustained load and peak loads. As long as we can handle peak loads, the sustained loads can be half of the quoted figure. 2. The row size has limited columns. That is, it is spans at most a dozen or so columns and most integer or varchar. It is more data i/o heavy rather than cpu heavy. Regards Dhaval On 5/11/07, Ben <bench@silentmedia.com> wrote: > Inserting 50,000 rows a second is, uh... difficult to do, no matter > what database you're using. You'll probably have to spool the inserts > and insert them as fast as you can, and just hope you don't fall too > far behind. > > But I'm suspecting that you aren't going to be doing much, if any, > referential integrity checking, at least beyond basic type checking. > You probably aren't going to care about multiple inserts affecting > each other, or worry about corruption if a given insert fails... in > fact, you probably aren't even going to need transactions at all, > other than as a way to insert faster. Is SQL the right tool for you? > > On May 11, 2007, at 1:43 PM, Dhaval Shah wrote: > > > Here is the straight dope, one of internal teams at my customer site > > is looking into MySql and replacing its storage engine so that they > > can store large amount of streamed data. The key here is that the data > > they are getting is several thousands of rows in an extremely short > > duration. They say that only MySql provides them the ability to > > replace the storage engine, which granted is easier. > > > > If I go with the statement that postgres can basically do what they > > intend to do for handling large datasets, I need to prepare my talking > > points. > > > > The requirements are as follows: > > > > 1. Large amount of streamed rows. In the order of @50-100k rows per > > second. I was thinking that the rows can be stored into a file and the > > file then copied into a temp table using copy and then appending those > > rows to the master table. And then dropping and recreating the index > > very lazily [during the first query hit or something like that] > > > > The table size can grow extremely large. Of course, if it can be > > partitioned, either by range or list. > > > > 2. Most of the streamed rows are very similar. Think syslog rows, > > where for most cases only the timestamp changes. Of course, if the > > data can be compressed, it will result in improved savings in terms of > > disk size. > > > > The key issue here is that the ultimate data usage is Write Once Read > > Many, and in that sense I am looking for a very optimal solution for > > bulk writes and maintaining indexes during bulk writes. > > > > So with some intelligent design, it is possible to use postgres. Any > > help in preparing my talking points is appreciated. > > > > Regards > > Dhaval > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 5: don't forget to increase your free space map settings > > -- Dhaval Shah
One approach would be to spool all the data to a flat file and then pull them into the database as you are able to. This would give you extremely high peak capability. On May 11, 2007, at 10:35 PM, Dhaval Shah wrote: > I do care about the following: > > 1. Basic type checking > 2. Knowing failed inserts. > 3. Non-corruption > 4. Macro transactions. That is a minimal read consistency. > > The following is not necessary > > 1. Referential integrity > > In this particular scenario, > > 1. There is a sustained load and peak loads. As long as we can handle > peak loads, the sustained loads can be half of the quoted figure. > 2. The row size has limited columns. That is, it is spans at most a > dozen or so columns and most integer or varchar. > > It is more data i/o heavy rather than cpu heavy. > > Regards > Dhaval > > On 5/11/07, Ben <bench@silentmedia.com> wrote: >> Inserting 50,000 rows a second is, uh... difficult to do, no matter >> what database you're using. You'll probably have to spool the inserts >> and insert them as fast as you can, and just hope you don't fall too >> far behind. >> >> But I'm suspecting that you aren't going to be doing much, if any, >> referential integrity checking, at least beyond basic type checking. >> You probably aren't going to care about multiple inserts affecting >> each other, or worry about corruption if a given insert fails... in >> fact, you probably aren't even going to need transactions at all, >> other than as a way to insert faster. Is SQL the right tool for you? >> >> On May 11, 2007, at 1:43 PM, Dhaval Shah wrote: >> >> > Here is the straight dope, one of internal teams at my customer >> site >> > is looking into MySql and replacing its storage engine so that they >> > can store large amount of streamed data. The key here is that >> the data >> > they are getting is several thousands of rows in an extremely short >> > duration. They say that only MySql provides them the ability to >> > replace the storage engine, which granted is easier. >> > >> > If I go with the statement that postgres can basically do what they >> > intend to do for handling large datasets, I need to prepare my >> talking >> > points. >> > >> > The requirements are as follows: >> > >> > 1. Large amount of streamed rows. In the order of @50-100k rows per >> > second. I was thinking that the rows can be stored into a file >> and the >> > file then copied into a temp table using copy and then appending >> those >> > rows to the master table. And then dropping and recreating the >> index >> > very lazily [during the first query hit or something like that] >> > >> > The table size can grow extremely large. Of course, if it can be >> > partitioned, either by range or list. >> > >> > 2. Most of the streamed rows are very similar. Think syslog rows, >> > where for most cases only the timestamp changes. Of course, if the >> > data can be compressed, it will result in improved savings in >> terms of >> > disk size. >> > >> > The key issue here is that the ultimate data usage is Write Once >> Read >> > Many, and in that sense I am looking for a very optimal solution >> for >> > bulk writes and maintaining indexes during bulk writes. >> > >> > So with some intelligent design, it is possible to use postgres. >> Any >> > help in preparing my talking points is appreciated. >> > >> > Regards >> > Dhaval >> > >> > ---------------------------(end of >> > broadcast)--------------------------- >> > TIP 5: don't forget to increase your free space map settings >> >> > > > -- > Dhaval Shah > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/11/07 21:35, Dhaval Shah wrote: > I do care about the following: > > 1. Basic type checking > 2. Knowing failed inserts. > 3. Non-corruption > 4. Macro transactions. That is a minimal read consistency. > > The following is not necessary > > 1. Referential integrity > > In this particular scenario, > > 1. There is a sustained load and peak loads. As long as we can handle > peak loads, the sustained loads can be half of the quoted figure. > 2. The row size has limited columns. That is, it is spans at most a > dozen or so columns and most integer or varchar. > > It is more data i/o heavy rather than cpu heavy. Have you tested PG (and MySQL, for that matter) to determine what kind of load they can handle on existing h/w? Back to the original post: 100K inserts/second is 360 *million* inserts per hour. That's a *lot*. Even if the steady-state is 50K inserts/sec that's 180M inserts/hr. If each record is 120 bytes, that's 43 gigabytes per hour. Which is 12MB/second. No problem from a h/w standpoint. However, it will fill a 300GB HDD in 7 hours. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGRZdhS9HxQb37XmcRAimAAJ9oi5AG0EcyATxeGDrlA1qdqU7krwCfc0k+ J7zMkiJiVKxS+DWM6I6Oujw= =D04k -----END PGP SIGNATURE-----
At 04:43 AM 5/12/2007, Dhaval Shah wrote: >1. Large amount of streamed rows. In the order of @50-100k rows per >second. I was thinking that the rows can be stored into a file and the >file then copied into a temp table using copy and then appending those >rows to the master table. And then dropping and recreating the index >very lazily [during the first query hit or something like that] Is it one process inserting or can it be many processes? Is it just a short (relatively) high burst or is that rate sustained for a long time? If it's sustained I don't see the point of doing so many copies. How many bytes per row? If the rate is sustained and the rows are big then you are going to need LOTs of disks (e.g. a large RAID10). When do you need to do the reads, and how up to date do they need to be? Regards, Link.
Consolidating my responses in one email. 1. The total data that is expected is some 1 - 1.5 Tb a day. 75% of the data comes in a period of 10 hours. Rest 25% comes in the 14 hours. Of course there are ways to smooth the load patterns, however the current scenario is as explained. 2 I do expect that the customer rolls in something like a NAS/SAN with Tb of disk space. The idea is to retain the data for a duration and offload it to tape. That leads to the question, can the data be compressed? Since the data is very similar, any compression would result in some 6x-10x compression. Is there a way to identify which partitions are in which data files and compress them until they are actually read? Regards Dhaval On 5/12/07, Lincoln Yeoh <lyeoh@pop.jaring.my> wrote: > At 04:43 AM 5/12/2007, Dhaval Shah wrote: > > >1. Large amount of streamed rows. In the order of @50-100k rows per > >second. I was thinking that the rows can be stored into a file and the > >file then copied into a temp table using copy and then appending those > >rows to the master table. And then dropping and recreating the index > >very lazily [during the first query hit or something like that] > > Is it one process inserting or can it be many processes? > > Is it just a short (relatively) high burst or is that rate sustained > for a long time? If it's sustained I don't see the point of doing so > many copies. > > How many bytes per row? If the rate is sustained and the rows are big > then you are going to need LOTs of disks (e.g. a large RAID10). > > When do you need to do the reads, and how up to date do they need to be? > > Regards, > Link. > > > > -- Dhaval Shah
At 8:49p on 12 May 2007, Dhaval Shah wrote: > That leads to the question, can the data be compressed? Since the data > is very similar, any compression would result in some 6x-10x > compression. Is there a way to identify which partitions are in which > data files and compress them until they are actually read? There was a very interesting article in ;login: magazine in April of this year discussing how they dealt with an exorbitant amount of largely similar data. The article claimed that through aggregation and gzip compression, they were able to reduce what they needed to store by roughly 350x, or about .7 bytes per 'event'. The article is The Secret Lives of Computers Exposed: Flight Data Recorder for Windows by Chad Verbowski You might try to get your mitts on that article for some ideas. I'm not sure you could apply any of their ideas directly to the Postgres backend data files, but perhaps somewhere in your pipeline. Kevin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/12/07 19:49, Dhaval Shah wrote: > Consolidating my responses in one email. > > 1. The total data that is expected is some 1 - 1.5 Tb a day. 75% of > the data comes in a period of 10 hours. Rest 25% comes in the 14 > hours. Of course there are ways to smooth the load patterns, however > the current scenario is as explained. > > 2 I do expect that the customer rolls in something like a NAS/SAN with > Tb of disk space. The idea is to retain the data for a duration and > offload it to tape. 45TB per month???? Wow. The archival process *must* be considered when designing the system. PostgreSQL's ability to use tablespaces and partitioned tables will make that much easier. Otherwise, you'd have to be deleting from one "side" of the table while inserting into the other "side". Partitioning will also let you divide the table into multiple "active" segments, so that multiple inserters can run simultaneously without stepping on each other while spreading the load across multiple controllers and RAID-sets. If it's a SAN/NAS that is organized into RAID-5 groups, make *sure* that it has *lots* of batter-backed write-back cache. Regarding compression: if the columns are integers or short VARCHAR fields, I do not see how compression can help you, unless you use block-layer compression. Which Linux doesn't do. Does FreeBSD have block-level compression? - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGR8mtS9HxQb37XmcRAjXQAJ9TN2FqU1Wo4PZmS6MAhxaJgCm6/wCfXIl8 wZOYG7vWxwODNaRwDGSJxYQ= =Qh2r -----END PGP SIGNATURE-----
Dhaval Shah wrote: > 2. Most of the streamed rows are very similar. Think syslog rows, > where for most cases only the timestamp changes. Of course, if the > data can be compressed, it will result in improved savings in terms of > disk size. If it really is usually just the timestamp that changes, one way to "compress" such data might be to split your logical row into two tables. First table has all the original columns but the timestanp, plus an ID. Second table has the timestamp and a foreign key into the first table. Depending on how wide your original row is, and how often it's only the timestamp that changes, this could result in decent "compression". Of course, now you need referential integrity. - John D. Burger MITRE
John D. Burger wrote: > Dhaval Shah wrote: > >> 2. Most of the streamed rows are very similar. Think syslog rows, >> where for most cases only the timestamp changes. Of course, if the >> data can be compressed, it will result in improved savings in terms of >> disk size. > > If it really is usually just the timestamp that changes, one way to > "compress" such data might be to split your logical row into two > tables. First table has all the original columns but the timestanp, > plus an ID. Second table has the timestamp and a foreign key into the > first table. Depending on how wide your original row is, and how often > it's only the timestamp that changes, this could result in decent > "compression". > > Of course, now you need referential integrity. I thought of something similar. Maybe you could put those timestamps in an array column; saves you a referential integrity check that you don't seem to need very much. OTOH, _if_ your log messages(?) look very similar each time, you may be able to turn the problem around; you store unique log messages, with the timestamps that they occured on. That way you rarely need to store more than a timestamp. It'll add time to look up the matching log message (there'll be quite a few less of them though). I'm quite confident you'll save time inserting records this way, although that's hand waving at this point. You may be able to parameterize some log messages and store the parameters with the timestamps. Takes a bit more processing though (regular expression matches maybe?), and you'll want to now all different log message permutations beforehand. Regards, -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //