Thread: High inserting by syslog

High inserting by syslog

From

"Valter Douglas Lisbôa Jr."

Date:

03 July 2008, 12:14:57

Hello all, I have a perl script thats load a entire day squid log to a
postgres table. I run it at midnight by cronjob and turns off the indexes
before do it (turning it on after). The script works fine, but I want to
change this to a diferent approach.

I'd like to insert on the fly the log lines, so long it be generated to have
the data on-line. But the table has some indexes and the load of lines is
about 300.000/day, so the average inserting is 3,48/sec. I think this could
overload the database server (i did not test yet), so if I want to create a
no indexed table to receive the on-line inserting and do a job moving all
lines to the main indexed table at midnight.

My question is, Does exists a better solution, or this tatic is a good way to
do this?

--
Valter Douglas Lisbôa Jr.
Sócio-Diretor
Trenix - IT Solutions
"Nossas Idéias, suas Soluções!"
www.trenix.com.br
contato@trenix.com.br
Tel. +55 19 3402.2957
Cel. +55 19 9183.4244

Re: High inserting by syslog

From

"Joshua D. Drake"

Date:

03 July 2008, 13:06:21


On Thu, 2008-07-03 at 12:05 -0300, Valter Douglas Lisbôa Jr. wrote:
> Hello all, I have a perl script thats load a entire day squid log to a
> postgres table. I run it at midnight by cronjob and turns off the indexes
> before do it (turning it on after). The script works fine, but I want to
> change this to a diferent approach.

Exactly how do you turn off the indexes?

>
> I'd like to insert on the fly the log lines, so long it be generated to have
> the data on-line. But the table has some indexes and the load of lines is
> about 300.000/day, so the average inserting is 3,48/sec. I think this could
> overload the database server (i did not test yet), so if I want to create a
> no indexed table to receive the on-line inserting and do a job moving all
> lines to the main indexed table at midnight.
>
> My question is, Does exists a better solution, or this tatic is a good way to
> do this?

300,000 inserts a day isn't that much for any reasonable server.

Sincerely,

Joshua D. Drake

Re: High inserting by syslog

From

Richard Huxton

Date:

03 July 2008, 13:11:31

Valter Douglas Lisbôa Jr. wrote:
> Hello all, I have a perl script thats load a entire day squid log to a
> postgres table. I run it at midnight by cronjob and turns off the indexes
> before do it (turning it on after). The script works fine, but I want to
> change this to a diferent approach.
>
> I'd like to insert on the fly the log lines, so long it be generated to have
> the data on-line. But the table has some indexes and the load of lines is
> about 300.000/day, so the average inserting is 3,48/sec. I think this could
> overload the database server (i did not test yet), so if I want to create a
> no indexed table to receive the on-line inserting and do a job moving all
> lines to the main indexed table at midnight.

There are two things to bear in mind.

1. What you need to worry about is the peak rate of inserts, not the
average. Even at 30 rows/sec that's not too bad.
2. What will your system do if the database is taken offline for a
period? How will it catch up?

The limiting factor will be the speed of your disks. Assuming a single
disk (no battery-backed raid cache) you'll be limited to your RPM (e.g.
10,000 commits / minute). That will fall off rapidly if you only have
one disk and it's busy doing other reads/writes. But, if you batch many
log-lines together you need many less commits.

So - to address both points above, I'd use a script with a flexible
batch-size.
1. Estimate how many log-lines need to be saved to the database.
2. Batch together a suitable number of lines (1-1000) and commit them to
the database.
3. Sleep 1-10 secs
4. Back to #1, disconnect and reconnect every once in a while.

If the database is unavailable for any reason, this script will
automatically feed rows faster when it returns.

> My question is, Does exists a better solution, or this tatic is a good way to
> do this?

You might want to partition the table monthly. That will make it easier
to manage a few years from now.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Also, consider increasing checkpoint_segments if you find the system
gets backed-up.
Perhaps consider setting synchronous_commit to off (but only for the
connection saving the log-lines to the database)
http://www.postgresql.org/docs/8.3/static/runtime-config-wal.html

--
   Richard Huxton
   Archonet Ltd

Re: High inserting by syslog

From

"Valter Douglas Lisbôa Jr."

Date:

03 July 2008, 13:23:39

On Thursday 03 July 2008 13:03:49 Joshua D. Drake wrote:
> On Thu, 2008-07-03 at 12:05 -0300, Valter Douglas Lisbôa Jr. wrote:
> > Hello all, I have a perl script thats load a entire day squid log to a
> > postgres table. I run it at midnight by cronjob and turns off the indexes
> > before do it (turning it on after). The script works fine, but I want to
> > change this to a diferent approach.
>
> Exactly how do you turn off the indexes?
DROP/CREATE

>
> > I'd like to insert on the fly the log lines, so long it be generated to
> > have the data on-line. But the table has some indexes and the load of
> > lines is about 300.000/day, so the average inserting is 3,48/sec. I think
> > this could overload the database server (i did not test yet), so if I
> > want to create a no indexed table to receive the on-line inserting and do
> > a job moving all lines to the main indexed table at midnight.
> >
> > My question is, Does exists a better solution, or this tatic is a good
> > way to do this?
>
> 300,000 inserts a day isn't that much for any reasonable server.
>
> Sincerely,
>
> Joshua D. Drake
Thanks


--
Valter Douglas Lisbôa Jr.
Sócio-Diretor
Trenix - IT Solutions
"Nossas Idéias, suas Soluções!"
www.trenix.com.br
contato@trenix.com.br
Tel. +55 19 3402.2957
Cel. +55 19 9183.4244

Re: High inserting by syslog

From

Alan Hodgson

Date:

03 July 2008, 13:27:34

On Thursday 03 July 2008, Richard Huxton <dev@archonet.com> wrote:
> You might want to partition the table monthly. That will make it easier
> to manage a few years from now.
> http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Definitely pay attention to this point ... it's pretty easy to stuff 300K
rows a day into a table but a lot more painful to manage the enormous table
that results from doing it for any length of time.

--
Alan

Re: High inserting by syslog

From

Steve Crawford

Date:

03 July 2008, 13:33:03

Valter Douglas Lisbôa Jr. wrote:
> Hello all, I have a perl script thats load a entire day squid log to a
> postgres table. I run it at midnight by cronjob and turns off the indexes
> before do it (turning it on after). The script works fine, but I want to
> change this to a diferent approach.
>
> I'd like to insert on the fly the log lines, so long it be generated to have
> the data on-line. But the table has some indexes and the load of lines is
> about 300.000/day, so the average inserting is 3,48/sec. I think this could
> overload the database server (i did not test yet), so if I want to create a
> no indexed table to receive the on-line inserting and do a job moving all
> lines to the main indexed table at midnight.
>
> My question is, Does exists a better solution, or this tatic is a good way to
> do this?
The average matters less than the peak. Unless your traffic is even
24x7, your rate will be higher. If your log is concentrated in an 8-hour
workday, your average daytime rate will be closer to 10/second with
peaks that are much higher. You might consider some form of buffering
between the Squid log and the database to avoid blocking. Your current
method has the advantage of moving the database workload to off-hours.

Instead of moving data, you might look into partitioning your data. How
long do you keep your logs actively available in PostgreSQL? I know one
company that partitions their log data into months (parent table with
child table for each month). They keep 12-months of data live so they
rotate through the child tables. At the start of a month, that month's
table is truncated. Modify as appropriate for your load - perhaps a
partition (child-table) for each day. Or a current-day child-table that
is migrated into a main-table nightly. Either way you can make it appear
that the parent-table is an up-to-date complete table.

You will need to do some reading on table partitioning if you go this
route. Pay special attention to the requirements needed to optimize queries.

You might also want to check your stats tables to make sure the indexes
you currently maintain are actually used by your queries and remove any
that are unnecessary to reduce index-maintenance overhead.

Another possible technique would be to have a nightly process that
creates partial-indexes. One set of indexes would cover all data prior
to midnight and the other set all data after midnight. Depending on the
nature of your "real-time" vs. historical queries, these might even be
different indexes. You will have to tweak your queries to make use of
your indexes but your live data won't have to update your "historical"
indexes. Warning: the date-constraint in the partial index must be
static - you can't do something like "...where squidlog_timestamp >
current_date...".  Your nightly process will be creating new indexes
with a new date-constraint. You might even be able to get away with
having no indexes on the current-day's data and just recreate historical
indexes nightly (similar to your no-index with nightly-insert).

But don't try the above till you determine you have a problem. On modest
3-year-old non-dedicated (also running file-storage, rsync backup,
mail...) hardware with basic SATA RAID1 we are handling a similar load
without strain.

Cheers,
Steve