Thread: PG vs ElasticSearch for Logs

PG vs ElasticSearch for Logs

From

Thomas Güttler

Date:

19 August 2016, 07:32:40

I want to store logs in a simple table.

Here my columns:

   Primary-key (auto generated)
   timestamp
   host
   service-on-host
   loglevel
   msg
   json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?



--
Thomas Guettler http://www.thomas-guettler.de/

Re: PG vs ElasticSearch for Logs

From

John R Pierce

Date:

19 August 2016, 07:42:21

On 8/19/2016 12:32 AM, Thomas Güttler wrote:
> What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for
adhoc querying.

  200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon
you're talking big tables.

in fact thats several rows/second on a 24/7 basis


--
john r pierce, recycling bits in santa cruz

Re: PG vs ElasticSearch for Logs

From

Thomas Güttler

Date:

19 August 2016, 08:57:57


Am 19.08.2016 um 09:42 schrieb John R Pierce:
> On 8/19/2016 12:32 AM, Thomas Güttler wrote:
>> What do you think?
>
> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>
>  200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>
> in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

I think indexing in postgres is much faster than grep.

And queries including json data are not possible with grep (or at least very hard to type)

My concern is which DB (or indexing) to use ...

Regards,
  Thomas


--
Thomas Guettler http://www.thomas-guettler.de/

Re: PG vs ElasticSearch for Logs

From

Sameer Kumar

Date:

19 August 2016, 09:21:59

On Fri, Aug 19, 2016 at 4:58 PM Thomas Güttler <guettliml@thomas-guettler.de> wrote:

Am 19.08.2016 um 09:42 schrieb John R Pierce:
> On 8/19/2016 12:32 AM, Thomas Güttler wrote:
>> What do you think?
>
> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>
> 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>
> in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

I think indexing in postgres is much faster than grep.

And queries including json data are not possible with grep (or at least very hard to type)

My concern is which DB (or indexing) to use ...

How will you be using the logs? What kind of queries? What kind of searches?
Correlating events and logs from various sources could be really easy with joins, count and summary operations.

The kind of volume you are anticipating should be fine with Postgres but before you really decide which one, you need to figure out what would you want to do with this data once it is in Postgres.

Regards,
Thomas

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Best Regards

Sameer Kumar | DB Solution Architect

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

Re: PG vs ElasticSearch for Logs

From

Thomas Güttler

Date:

19 August 2016, 10:06:11


Am 19.08.2016 um 11:21 schrieb Sameer Kumar:
>
>
> On Fri, Aug 19, 2016 at 4:58 PM Thomas Güttler <guettliml@thomas-guettler.de <mailto:guettliml@thomas-guettler.de>>
wrote:
>
>
>
>     Am 19.08.2016 um 09:42 schrieb John R Pierce:
>     > On 8/19/2016 12:32 AM, Thomas Güttler wrote:
>     >> What do you think?
>     >
>     > I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>     >
>     >  200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>     >
>     > in fact thats several rows/second on a 24/7 basis
>
>     There is no need to store them more then 6 weeks in my current use case.
>
>     I think indexing in postgres is much faster than grep.
>
>     And queries including json data are not possible with grep (or at least very hard to type)
>
>     My concern is which DB (or indexing) to use ...
>
>
> How will you be using the logs? What kind of queries? What kind of searches?
> Correlating events and logs from various sources could be really easy with joins, count and summary operations.

Wishes raise with possibilities. First I want to do simple queries about hosts and timestamps. Then some simple
substring matches.

Up to now to structured logging (the json column) gets created. But if it gets filled, we will find a use case where
we use ssh+grep up to now.

Up to now we need no stemming and language support.

> The kind of volume you are anticipating should be fine with Postgres but before you really decide which one, you need
to
> figure out what would you want to do with this data once it is in Postgres.

The goal is a bit fuzzy up to now: Better overview.

Thank you for your feedback ("The kind of volume you are anticipating should be fine with Postgres").

I guess I will use postgres, especial since Django ORM supports JSON in Postgres:

   https://docs.djangoproject.com/en/1.10/ref/contrib/postgres/fields/#jsonfield


Regards,
   Thomas Güttler


--
Thomas Guettler http://www.thomas-guettler.de/

Re: PG vs ElasticSearch for Logs

From

Rafal Pietrak

Date:

19 August 2016, 10:25:26


W dniu 19.08.2016 o 10:57, Thomas Güttler pisze:
>
>
> Am 19.08.2016 um 09:42 schrieb John R Pierce:
[-------------]
>> in fact thats several rows/second on a 24/7 basis
>
> There is no need to store them more then 6 weeks in my current use case.
>
> I think indexing in postgres is much faster than grep.

Not so much IMHE(xperience).

1. if you know what you are looking for: grep's the best

2. if you dont .... grep (or more/less/vi) is the best.

only when you routinely update/join/etc, RDBMS really shines.But that's
not what you normally do with your logs. Right?

but then again. there is an additional benefit of "having everyting
under one hood" - so standarising on a single repository (like rdbms)
has its benefits.

regards,

-R

Re: PG vs ElasticSearch for Logs

From

Andreas Kretschmer

Date:

19 August 2016, 10:44:09

Thomas Güttler <guettliml@thomas-guettler.de> wrote:

>> How will you be using the logs? What kind of queries? What kind of searches?
>> Correlating events and logs from various sources could be really easy with joins, count and summary operations.
>
> Wishes raise with possibilities. First I want to do simple queries about
> hosts and timestamps. Then some simple substring matches.

for append-only tables like this consider 9.5 and BRIN-Indexes for
timestamp-searches. But if you deletes after N weeks BRIN shouldn't work
properly because of vacuum and re-use of space within the table.
Do you know BRIN?

So, in your case, consider partitioning, maybe per month. So you can
also avoid mess with table and index bloat.

Greetings from Dresden to Chemnitz (is this still valid?)

Regards, Andreas Kretschmer
--
Andreas Kretschmer
http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PG vs ElasticSearch for Logs

From

Francisco Olarte

Date:

19 August 2016, 11:00:24

On Fri, Aug 19, 2016 at 12:44 PM, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:
> for append-only tables like this consider 9.5 and BRIN-Indexes for
> timestamp-searches. But if you deletes after N weeks BRIN shouldn't work
> properly because of vacuum and re-use of space within the table.
> Do you know BRIN?
>
> So, in your case, consider partitioning, maybe per month. So you can
> also avoid mess with table and index bloat.

If done properly he can use both. For 6 weeks I would use seven
partition, current+6 previous, drop old partition weekly, so
effectively they become append only and he can use BRIN too.

Even better, if he normally inserts in batches ( it happens to me with
some log-like data, I rotate the file and insert all rotated data
periodically ) he can use a staging table ( 1 master, and inheriting
from it seven constrained week partition plus one unconstrained
staging partition). Insertions go into staging and are moved with a
small delay to the corresponding partition, using and ordered select
so they go in perfect order into their final resting place and it can
be vacuumed just after that ( if they are log lines and the maximum
delay is X you just move every row older than that from staging to the
partition with whatever period is best). Staging partition is normally
small and cached and can be processed quite fast ( with 200k/day an
hourly movement will leave staging with less than about 10k rows if
distribution is somehow uniform ).

Francisco Olarte.

Re: PG vs ElasticSearch for Logs

From

Thomas Güttler

Date:

19 August 2016, 11:44:33


Am 19.08.2016 um 12:44 schrieb Andreas Kretschmer:
> Thomas Güttler <guettliml@thomas-guettler.de> wrote:
>
>>> How will you be using the logs? What kind of queries? What kind of searches?
>>> Correlating events and logs from various sources could be really easy with joins, count and summary operations.
>>
>> Wishes raise with possibilities. First I want to do simple queries about
>> hosts and timestamps. Then some simple substring matches.
>
> for append-only tables like this consider 9.5 and BRIN-Indexes for
> timestamp-searches. But if you deletes after N weeks BRIN shouldn't work
> properly because of vacuum and re-use of space within the table.
> Do you know BRIN?
>
> So, in your case, consider partitioning, maybe per month. So you can
> also avoid mess with table and index bloat.

Thank you very much for these hints. I did not know BRIN before.


> Greetings from Dresden to Chemnitz (is this still valid?)

Yes, I am in Chemnitz/Germany. Everything (kids, wife, friends, sports, job) is fine.

I hope the same with you?

Regards,
   Thomas Güttler



--
Thomas Guettler http://www.thomas-guettler.de/

Re: PG vs ElasticSearch for Logs

From

John R Pierce

Date:

19 August 2016, 15:41:59

On 8/19/2016 3:44 AM, Andreas Kretschmer wrote:
> So, in your case, consider partitioning, maybe per month. So you can
> also avoid mess with table and index bloat.

with his 6 week retention, i'd partition by week.



--
john r pierce, recycling bits in santa cruz

Re: PG vs ElasticSearch for Logs

From

Merlin Moncure

Date:

19 August 2016, 17:51:35

On Fri, Aug 19, 2016 at 2:32 AM, Thomas Güttler
<guettliml@thomas-guettler.de> wrote:
> I want to store logs in a simple table.
>
> Here my columns:
>
>   Primary-key (auto generated)
>   timestamp
>   host
>   service-on-host
>   loglevel
>   msg
>   json (optional)
>
> I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We use SOLR (which is similar to ElasticSearch) here for json document
retrieval.  Agreeing to do this was one of the biggest mistakes in my
professional career.  This choice was somewhat forced because at the
time jsonb was not baked.  In my opinion, jsonb outclasses these types
of services particularly if you are already invested in postgres.  The
specifics of your requirements also plays into this decision
naturally.  The bottom line though is that these kinds of systems are
not nearly as fast or robust as they claim to be particularly if you
wander off the use cases they are engineered for (like needing
transactions or joins for example).  They also tend to be fairly
opaque in how they operate and the supporting tooling is laughable
relative to established database systems.

Postgres OTOH can be made to do pretty much anything given sufficient
expertise and a progressive attitude.

merlin

Re: PG vs ElasticSearch for Logs

From

Andy Colson

Date:

19 August 2016, 17:59:59

On 8/19/2016 2:32 AM, Thomas Güttler wrote:
> I want to store logs in a simple table.
>
> Here my columns:
>
>   Primary-key (auto generated)
>   timestamp
>   host
>   service-on-host
>   loglevel
>   msg
>   json (optional)
>
> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>
> We don't have high traffic. About 200k rows per day.
>
> My heart beats for postgres. We use it since several years.
>
> On the other hand, the sentence "Don't store logs in a DB" is
> somewhere in my head.....
>
> What do you think?
>
>
>

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty.  I dumped a ton of logs into it, and
made a pretty dashboard ... but in the end it didn't really help me, and
wasn't that useful.  My problem is, I don't want to have to go look at
it.  If something goes bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to.  It made it hard to find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats.  But its an hourly summary, which the
website turns around and queries the stats to show pretty usage graphs.

In the end, PG or ES, all depends on what you want.

-Andy

Re: PG vs ElasticSearch for Logs

From

Chris Mair

Date:

19 August 2016, 20:40:15

On 19/08/16 10:57, Thomas Güttler wrote:

>>> What do you think?
>>
>> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>>
>>  200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>>
>> in fact thats several rows/second on a 24/7 basis
>
> There is no need to store them more then 6 weeks in my current use case.

Hi,

to me this kind of data looks like something Postgres can handle with ease.

We're talking about 8.4M rows here.

Coincidentally, I was trying out the new parallel query feature in the
9.6 beta just now and decided to use your numbers as a test case :)

I can create 8.4M records having a timestamp and a random ~ 250 character string
in ~ 31 seconds:

pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts,
pg96-#        repeat(random()::text, 15) as msg
pg96-# into t1
pg96-# from generate_series(1, 6 * 7 * 200000) as sec;
SELECT 8400000
Time: 30858.274 ms

Table size is 2.4 GB.

This gives about 6 weeks. A query to scan the whole thing on the narrow column
takes ~ 400 msec, like this:

pg96=# select min(ts), max(ts) from t1;
               min              |              max
-------------------------------+-------------------------------
  2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00
(1 row)

Time: 409.468 ms

Even running an unanchored regular expression (!) on the wider column is doable:

pg96=# select count(*) from t1 where msg ~ '12345';
  count
-------
    955
(1 row)

Time: 3146.838 ms

If you have some filter, not everything needs to be regexped and this gets pretty fast:

pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345';
  count
-------
     24
(1 row)

Time: 391.577 ms

All this is without indices. Your data is more structured than my test, so undoubtly you will
get some gain from indices...

Here is something more analytical - basically same as the count(*) above:

pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date;
      ts     | count
------------+-------
  2016-08-19 |    26
  2016-08-20 |    28
  [...]
  2016-09-28 |    21
  2016-09-29 |    33
(42 rows)

Time: 3157.010 ms

Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is
run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and
the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost
like cheating ;)

Bye,
Chris.

Re: PG vs ElasticSearch for Logs

From

Sameer Kumar

Date:

20 August 2016, 00:49:15

On Sat, 20 Aug 2016, 2:00 a.m. Andy Colson, <andy@squeakycode.net> wrote:

On 8/19/2016 2:32 AM, Thomas Güttler wrote:
> I want to store logs in a simple table.
>
> Here my columns:
>
> Primary-key (auto generated)
> timestamp
> host
> service-on-host
> loglevel
> msg
> json (optional)
>
> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>
> We don't have high traffic. About 200k rows per day.
>
> My heart beats for postgres. We use it since several years.
>
> On the other hand, the sentence "Don't store logs in a DB" is
> somewhere in my head.....
>
> What do you think?
>
>
>

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton of logs into it, and
made a pretty dashboard ... but in the end it didn't really help me, and
wasn't that useful. My problem is, I don't want to have to go look at
it. If something goes bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

There are tools from Elastic Stack which could have helped you achieve email alerts and gather top or tailing of logfile

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly summary, which the
website turns around and queries the stats to show pretty usage graphs.

In the end, PG or ES, all depends on what you want.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Best Regards

Sameer Kumar | DB Solution Architect

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

Re: PG vs ElasticSearch for Logs

From

Thomas Güttler

Date:

22 August 2016, 07:27:18

Thank you Chris for looking at my issue in such detail.

Yes, the parallel feature rocks.

Regards,
   Thomas Güttler

Am 19.08.2016 um 22:40 schrieb Chris Mair:
> On 19/08/16 10:57, Thomas Güttler wrote:
>
>
>>>> What do you think?
>>>
>>> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>>>
>>>  200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>>>
>>> in fact thats several rows/second on a 24/7 basis
>>
>> There is no need to store them more then 6 weeks in my current use case.
>
>
> Hi,
>
> to me this kind of data looks like something Postgres can handle with ease.
>
> We're talking about 8.4M rows here.
>
> Coincidentally, I was trying out the new parallel query feature in the
> 9.6 beta just now and decided to use your numbers as a test case :)
>
> I can create 8.4M records having a timestamp and a random ~ 250 character string
> in ~ 31 seconds:
>
> pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts,
> pg96-#        repeat(random()::text, 15) as msg
> pg96-# into t1
> pg96-# from generate_series(1, 6 * 7 * 200000) as sec;
> SELECT 8400000
> Time: 30858.274 ms
>
> Table size is 2.4 GB.
>
> This gives about 6 weeks. A query to scan the whole thing on the narrow column
> takes ~ 400 msec, like this:
>
> pg96=# select min(ts), max(ts) from t1;
>               min              |              max
> -------------------------------+-------------------------------
>  2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00
> (1 row)
>
> Time: 409.468 ms
>
> Even running an unanchored regular expression (!) on the wider column is doable:
>
> pg96=# select count(*) from t1 where msg ~ '12345';
>  count
> -------
>    955
> (1 row)
>
> Time: 3146.838 ms
>
> If you have some filter, not everything needs to be regexped and this gets pretty fast:
>
> pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345';
>  count
> -------
>     24
> (1 row)
>
> Time: 391.577 ms
>
> All this is without indices. Your data is more structured than my test, so undoubtly you will
> get some gain from indices...
>
> Here is something more analytical - basically same as the count(*) above:
>
> pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date;
>      ts     | count
> ------------+-------
>  2016-08-19 |    26
>  2016-08-20 |    28
>  [...]
>  2016-09-28 |    21
>  2016-09-29 |    33
> (42 rows)
>
> Time: 3157.010 ms
>
> Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is
> run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and
> the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost
> like cheating ;)
>
> Bye,
> Chris.
>
>
>
>
>

--
Thomas Guettler http://www.thomas-guettler.de/

Re: PG vs ElasticSearch for Logs

From

Thomas Güttler

Date:

22 August 2016, 07:39:44


Am 19.08.2016 um 19:59 schrieb Andy Colson:
> On 8/19/2016 2:32 AM, Thomas Güttler wrote:
>> I want to store logs in a simple table.
>>
>> Here my columns:
>>
>>   Primary-key (auto generated)
>>   timestamp
>>   host
>>   service-on-host
>>   loglevel
>>   msg
>>   json (optional)
>>
>> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>>
>> We don't have high traffic. About 200k rows per day.
>>
>> My heart beats for postgres. We use it since several years.
>>
>> On the other hand, the sentence "Don't store logs in a DB" is
>> somewhere in my head.....
>>
>> What do you think?
>>
>>
>>
>
> I played with ElasticSearch a little, mostly because I wanted to use Kibana which looks really pretty.  I dumped a
ton
> of logs into it, and made a pretty dashboard ... but in the end it didn't really help me, and wasn't that useful.  My
> problem is, I don't want to have to go look at it.  If something goes bad, then I want an email alert, at which point
> I'm going to go run top, and tail the logs.
>
> Another problem I had with kibana/ES is the syntax to search stuff is different than I'm used to.  It made it hard to
> find stuff in kibana.
>
> Right now, I have a perl script that reads apache logs and fires off updates into PG to keep stats.  But its an
hourly
> summary, which the website turns around and queries the stats to show pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide spread:

   https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

> In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not receive http requests.

Regards,
   Thomas Güttler


--
Thomas Guettler http://www.thomas-guettler.de/

Re: PG vs ElasticSearch for Logs

From

Sameer Kumar

Date:

22 August 2016, 10:13:50

On Mon, 22 Aug 2016, 3:40 p.m. Thomas Güttler, <guettliml@thomas-guettler.de> wrote:

Am 19.08.2016 um 19:59 schrieb Andy Colson:
> On 8/19/2016 2:32 AM, Thomas Güttler wrote:
>> I want to store logs in a simple table.
>>
>> Here my columns:
>>
>> Primary-key (auto generated)
>> timestamp
>> host
>> service-on-host
>> loglevel
>> msg
>> json (optional)
>>
>> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>>
>> We don't have high traffic. About 200k rows per day.
>>
>> My heart beats for postgres. We use it since several years.
>>
>> On the other hand, the sentence "Don't store logs in a DB" is
>> somewhere in my head.....
>>
>> What do you think?
>>
>>
>>
>
> I played with ElasticSearch a little, mostly because I wanted to use Kibana which looks really pretty. I dumped a ton
> of logs into it, and made a pretty dashboard ... but in the end it didn't really help me, and wasn't that useful. My
> problem is, I don't want to have to go look at it. If something goes bad, then I want an email alert, at which point
> I'm going to go run top, and tail the logs.
>
> Another problem I had with kibana/ES is the syntax to search stuff is different than I'm used to. It made it hard to
> find stuff in kibana.
>
> Right now, I have a perl script that reads apache logs and fires off updates into PG to keep stats. But its an hourly
> summary, which the website turns around and queries the stats to show pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide spread:

https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

> In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not receive http requests.

You may decide not to use Elasticsearch but take a look at other components of Elastic Stack like logstash and beats. They can be helpful even when you use Postgres as the end point. Otherwise (IMHO), you would spend a lot of time writing scripts and jobs to capture and stream logs. If I were you, I would not want to do that.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Best Regards

Sameer Kumar | DB Solution Architect

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

Re: PG vs ElasticSearch for Logs

From

Andy Colson

Date:

22 August 2016, 14:03:53

On 8/22/2016 2:39 AM, Thomas Güttler wrote:
>
>
> Am 19.08.2016 um 19:59 schrieb Andy Colson:
>> On 8/19/2016 2:32 AM, Thomas Güttler wrote:
>>> I want to store logs in a simple table.
>>>
>>> Here my columns:
>>>
>>>   Primary-key (auto generated)
>>>   timestamp
>>>   host
>>>   service-on-host
>>>   loglevel
>>>   msg
>>>   json (optional)
>>>
>>> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>>>
>>> We don't have high traffic. About 200k rows per day.
>>>
>>> My heart beats for postgres. We use it since several years.
>>>
>>> On the other hand, the sentence "Don't store logs in a DB" is
>>> somewhere in my head.....
>>>
>>> What do you think?
>>>
>>>
>>>
>>
>> I played with ElasticSearch a little, mostly because I wanted to use
>> Kibana which looks really pretty.  I dumped a ton
>> of logs into it, and made a pretty dashboard ... but in the end it
>> didn't really help me, and wasn't that useful.  My
>> problem is, I don't want to have to go look at it.  If something goes
>> bad, then I want an email alert, at which point
>> I'm going to go run top, and tail the logs.
>>
>> Another problem I had with kibana/ES is the syntax to search stuff is
>> different than I'm used to.  It made it hard to
>> find stuff in kibana.
>>
>> Right now, I have a perl script that reads apache logs and fires off
>> updates into PG to keep stats.  But its an hourly
>> summary, which the website turns around and queries the stats to show
>> pretty usage graphs.
>
> You use Perl to read apache logs. Does this work?
>
> Forwarding logs reliably is not easy. Logs are streams, files in unix
> are not streams. Sooner or later
> the files get rotated. RELP exists, but AFAIK it's usage is not wide
> spread:
>
>   https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol
>
> Let's see how to get the logs into postgres ....
>
>> In the end, PG or ES, all depends on what you want.
>
> Most of my logs start from a http request. I want a unique id per request
> in every log line which gets created. This way I can trace the request,
> even if its impact spans to several hosts and systems which do not
> receive http requests.
>
> Regards,
>   Thomas Güttler
>
>

I don't read the file.  In apache.conf:

# v, countyia, ip, sess, ts, url, query, status
LogFormat
"3,%{countyName}e,%a,%{VCSID}C,%{%Y-%m-%dT%H:%M:%S%z}t,\"%U\",\"%q\",%>s"
csv3

CustomLog "|/usr/local/bin/statSender.pl -r 127.0.0.1" csv3

I think I read somewhere that if you pipe to a script (like above) and
you dont read fast enough, it could slow apache down.  That's why the
script above dumps do redis first.  That way I can move processes
around, restart the database, etc, etc, and not break apache in any way.

The important part of the script:

while (my $x = <>)
{
    chomp($x);
    next unless ($x);
try_again:
    if ($redis)
    {
        eval {
            $redis->lpush($qname, $x);
        };
        if ($@)
        {
            $redis = redis_connect();
            goto try_again;
        }
        # just silence this one
        eval {
            $redis->ltrim($qname, 0, 1000);
        };
    }
}

Any other machine, or even multiple, then reads from redis and inserts
into PG.

You can see, in my script, I trim the queue to 1000 items, but that's
because I'm not as worried about loosing results.  Your setup would
probably be different.  I also setup redis to not save anything to disk,
again, because I don't mind if I loose a few hits here or there.  But
you get the idea.

-Andy

Re: PG vs ElasticSearch for Logs

From

Terry Schmitt

Date:

23 August 2016, 20:42:47

Certainly Postgres is capable of handling this volume just fine. Throw in some partition rotation handling and you have a solution.

If you want to play with something different, check out Graylog, which is backed by Elasticsearch. A bit more work to set up than a single Postgres table, but it has ben a success for us storing, syslog, app logs, and Postgres logs from several hundred network devices, Windows and Linux servers. Rotation is handled based on your requirements and drilling down to the details is trivial. Alerting is baked in as well. It could well be overkill for your needs, but I don't know what your environment looks like.

On Mon, Aug 22, 2016 at 7:03 AM, Andy Colson <andy@squeakycode.net> wrote:

On 8/22/2016 2:39 AM, Thomas Güttler wrote:

Am 19.08.2016 um 19:59 schrieb Andy Colson:
On 8/19/2016 2:32 AM, Thomas Güttler wrote:
I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton
of logs into it, and made a pretty dashboard ... but in the end it
didn't really help me, and wasn't that useful. My
problem is, I don't want to have to go look at it. If something goes
bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to
find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly
summary, which the website turns around and queries the stats to show
pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix
are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide
spread:

https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not
receive http requests.

Regards,
Thomas Güttler

I don't read the file. In apache.conf:

# v, countyia, ip, sess, ts, url, query, status
LogFormat "3,%{countyName}e,%a,%{VCSID}C,%{%Y-%m-%dT%H:%M:%S%z}t,\"%U\",\"%q\",%>s" csv3

CustomLog "|/usr/local/bin/statSender.pl -r 127.0.0.1" csv3

I think I read somewhere that if you pipe to a script (like above) and you dont read fast enough, it could slow apache down. That's why the script above dumps do redis first. That way I can move processes around, restart the database, etc, etc, and not break apache in any way.

The important part of the script:

while (my $x = <>)
{
chomp($x);
next unless ($x);
try_again:
if ($redis)
{
eval {
$redis->lpush($qname, $x);
};
if ($@)
{
$redis = redis_connect();
goto try_again;
}
# just silence this one
eval {
$redis->ltrim($qname, 0, 1000);
};
}
}

Any other machine, or even multiple, then reads from redis and inserts into PG.

You can see, in my script, I trim the queue to 1000 items, but that's because I'm not as worried about loosing results. Your setup would probably be different. I also setup redis to not save anything to disk, again, because I don't mind if I loose a few hits here or there. But you get the idea.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

ElasticSearch Beats

From

Thomas Güttler

Date:

24 August 2016, 08:42:02

> You may decide not to use Elasticsearch but take a look at other components of Elastic Stack like logstash and beats.
> They can be helpful even when you use Postgres as the end point. Otherwise (IMHO), you would spend a lot of time
writing
> scripts and jobs to capture and stream logs. If I were you, I would not want to do that.

I wanted to know what Beats is. I found this:

{{{
  Beats is the platform for building lightweight, open source data shippers for many types of data you want to enrich
  with Logstash, search and analyze in Elasticsearch, and visualize in Kibana.
}}}

Oh what fun. How does "enrich data" work? Imagine you have N bytes. There
are several ways to create N+M bytes from N bytes. I call this bloating,
not enriching. You can analyze data, compress it, visualize it.... But AFAIK
you can't enrich data.

Reading the next pages of the advertisement:
{{{
  enriched and combined with other data sources using Logstash
}}}

OK, this way data can be enriched by using other data sources.

Still I see no use case for beats.

Do you use it? How does it help you?

I guess I take the boring route. I never used it, but it looks solid:
   https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Regards,
   Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

Graylog

From

Thomas Güttler

Date:

24 August 2016, 08:43:37


Am 23.08.2016 um 22:42 schrieb Terry Schmitt:
> Certainly Postgres is capable of handling this volume just fine. Throw in some partition rotation handling and you
have
> a solution.
> If you want to play with something different, check out Graylog, which is backed by Elasticsearch. A bit more work to
> set up than a single Postgres table, but it has ben a success for us storing, syslog, app logs, and Postgres logs
from
> several hundred network devices, Windows and Linux servers. Rotation is handled based on your requirements and
drilling
> down to the details is trivial. Alerting is baked in as well. It could well be overkill for your needs, but I don't
know
> what your environment looks like.

Thank you for this hint. I will look at it.

Regards,
   Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/