Thread: High Frequency Inserts to Postgres Database vs Writing to a File

High Frequency Inserts to Postgres Database vs Writing to a File

From
Jay Manni
Date:

Hi:

 

I have an application wherein a process needs to read data from a stream and store the records for further analysis and reporting. The data in the stream is in the form of variable length records with clearly defined fields – so it can be stored in a database or in a file. The only caveat is that the rate of records coming in the stream could be several 1000 records a second.

 

The design choice I am faced with currently is whether to use a postgres database or a flat file for this purpose. My application already maintains a postgres (8.3.4) database for other reasons – so it seemed like the straightforward thing to do. However I am concerned about the performance overhead of writing several 1000 records a second to the database. The same database is being used simultaneously for other activities as well and I do not want those to be adversely affected by this operation (especially the query times). The advantage of running complex queries to mine the data in various different ways is very appealing but the performance concerns are making me wonder if just using a flat file to store the data would be a better approach.

 

Anybody have any experience in high frequency writes to a postgres database?

 

- Jay

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
Scott Marlowe
Date:
On Tue, Nov 3, 2009 at 8:12 PM, Jay Manni <JManni@fireeye.com> wrote:
> Hi:
>
>
>
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting. The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.
>
>
>
> The design choice I am faced with currently is whether to use a postgres
> database or a flat file for this purpose. My application already maintains a

A common approach is to store them in flat files, then insert the flat
files at a later time so that if the db falls behind no data is lost.

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
David Saracini
Date:
"could be several 1000 records a second."

So, are there periods when there are no/few records coming in?  Do the records/data/files really need to be persisted?  

The following statement makes me think you should go the flat file route:

"The advantage of running complex queries to mine the data in various different ways is very appealing"

Please don't be offended, but that sounds a little like feature creep.  I've found that it's best to keep it simple and don't do a bunch of work now for what might be requested in the future.

I know it's not exactly what you were looking for...  Just food for thought.

Best of luck!

David


From: Jay Manni <JManni@FireEye.com>
To: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Sent: Tue, November 3, 2009 7:12:29 PM
Subject: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

Hi:

 

I have an application wherein a process needs to read data from a stream and store the records for further analysis and reporting. The data in the stream is in the form of variable length records with clearly defined fields – so it can be stored in a database or in a file. The only caveat is that the rate of records coming in the stream could be several 1000 records a second.

 

The design choice I am faced with currently is whether to use a postgres database or a flat file for this purpose. My application already maintains a postgres (8.3.4) database for other reasons – so it seemed like the straightforward thing to do. However I am concerned about the performance overhead of writing several 1000 records a second to the database. The same database is being used simultaneously for other activities as well and I do not want those to be adversely affected by this operation (especially the query times). The advantage of running complex queries to mine the data in various different ways is very appealing but the performance concerns are making me wonder if just using a flat file to store the data would be a better approach.

 

Anybody have any experience in high frequency writes to a postgres database?

 

- Jay

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
Merlin Moncure
Date:
n Tue, Nov 3, 2009 at 10:12 PM, Jay Manni <JManni@fireeye.com> wrote:
> Hi:
>
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting. The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.

Postgres doing this is going to depend on primarily two things:
*) Your hardware
*) The mechanism you use to insert the data into the database

Postgres can handle multiple 1000 insert/sec but your hardware most
likely can't handle multiple 1000 transaction/sec if fsync is on.  You
definitely want to batch the insert into the database somehow, so that
something accumulates the data (could be a simple file), and flushes
it in to the database.   The 'flush' ideally should use copy but
multiple row insert is ok too.  Try to avoid inserting one row at a
time even if in a transaction.

If you are bulk inserting 1000+ records/sec all day long, make sure
you have provisioned enough storage for this (that's 86M records/day),
and you should immediately start thinking about partitioning and
rotating the log table (if you log to the database, partition/rotate
is basically already baked in anyways).

The effects on other users of the database are really hard to predict
-- it's going to depend on how much resources you have (cpu and
especially disk) to direct towards the loading and how the database is
being used.  I expect it shouldn't be too bad unless your dataase is
already i/o loaded.  The good news is testing this is relatively easy
you can simulate a load test and just run it during typical use and
see how it affects other users.  Standard o/s tools (iostat, top), and
database log with min_duration_statement are going to be a big help
here.   If you start seeing big leaps in iowait corresponding with
unexpectedly lagging queries in your app , you probably should think
about scrapping the idea.

merlin

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
Jeff Janes
Date:
On Tue, Nov 3, 2009 at 7:12 PM, Jay Manni <JManni@fireeye.com> wrote:
> Hi:
>
>
>
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting.

Where is the stream coming from?  What happens if the process reading
the stream fails but the one generating the stream keeps going?

> The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.
>
> The design choice I am faced with currently is whether to use a postgres
> database or a flat file for this purpose. My application already maintains a
> postgres (8.3.4) database for other reasons – so it seemed like the
> straightforward thing to do. However I am concerned about the performance
> overhead of writing several 1000 records a second to the database. The same
> database is being used simultaneously for other activities as well and I do
> not want those to be adversely affected by this operation (especially the
> query times).

I would not use the database, but just a flat file.  You can always load it
to a database later as long as you keep the files around, if a
compelling reason arises.

> The advantage of running complex queries to mine the data in
> various different ways is very appealing

Do you have concrete plans to do this, or just vague notions?

Even if the loading of 1000s of records per second doesn't adversely
impact the performance of other things going on in the server, surely
doing complex queries on hundreds of millions of records will.  How
long to you plan on storing the records in the database, and how to
delete them out?   Do you already know what indexes, if any, should be
on the table?

Jeff

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
Anj Adu
Date:
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting. The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.
> The design choice I am faced with currently is whether to use a postgres
> database or a flat file for this purpose. My application already maintains a
> postgres (8.3.4) database for other reasons – so it seemed like the
> straightforward thing to do. However I am concerned about the performance
> overhead of writing several 1000 records a second to the database. The same
> database is being used simultaneously for other activities as well and I do
> not want those to be adversely affected by this operation (especially the
> query times). The advantage of running complex queries to mine the data in
> various different ways is very appealing but the performance concerns are
> making me wonder if just using a flat file to store the data would be a
> better approach.
>
>
>
> Anybody have any experience in high frequency writes to a postgres database?


As mentioned earlier in this thread,,make sure your hardware can
scale. You may hit a "monolithic hardware" wall and may have to
distribute your data across multiple boxes and have your application
manage the distribution and access. A RAID 10 storage
architecture(since fast writes are critical) with a mulitple core box
(preferably 8) having fast scsi disks (15K rpm) may be a good starting
point.

We have a similar requirement and we scale by distributing the data
across multiple boxes. This is key.

If you need to run complex queries..plan on aggregation strategies
(processes that aggregate and optimize the data storage to facilitate
faster access).

Partitioning is key. You will need to purge old data at some point.
Without partitions..you will run into trouble with the time taken to
delete old data as well as availability of disk space.

These are just guidelines for a big warehouse style database.

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
Craig Ringer
Date:
Merlin Moncure wrote:

> Postgres can handle multiple 1000 insert/sec but your hardware most
> likely can't handle multiple 1000 transaction/sec if fsync is on.

commit_delay or async commit should help a lot there.

http://www.postgresql.org/docs/8.3/static/wal-async-commit.html
http://www.postgresql.org/docs/8.3/static/runtime-config-wal.html

Please do *not* turn fsync off unless you want to lose your data.

> If you are bulk inserting 1000+ records/sec all day long, make sure
> you have provisioned enough storage for this (that's 86M records/day),

plus any index storage, room for dead tuples if you ever issue UPDATEs, etc.

--
Craig Ringer

Re: High Frequency Inserts to Postgres Database vs Writing to a File

From
Jay Manni
Date:
Thanks to all for the responses. Based on all the recommendations, I am going to try a batched commit approach; along
withdata purging policies so that the data storage does not grow beyond certain thresholds. 

- J

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Wednesday, November 04, 2009 5:12 PM
To: Merlin Moncure
Cc: Jay Manni; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

Merlin Moncure wrote:

> Postgres can handle multiple 1000 insert/sec but your hardware most
> likely can't handle multiple 1000 transaction/sec if fsync is on.

commit_delay or async commit should help a lot there.

http://www.postgresql.org/docs/8.3/static/wal-async-commit.html
http://www.postgresql.org/docs/8.3/static/runtime-config-wal.html

Please do *not* turn fsync off unless you want to lose your data.

> If you are bulk inserting 1000+ records/sec all day long, make sure
> you have provisioned enough storage for this (that's 86M records/day),

plus any index storage, room for dead tuples if you ever issue UPDATEs, etc.

--
Craig Ringer

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.