Thread: PG vs ElasticSearch for Logs
I want to store logs in a simple table. Here my columns: Primary-key (auto generated) timestamp host service-on-host loglevel msg json (optional) I am unsure which DB to choose: Postgres, ElasticSearch or ...? We don't have high traffic. About 200k rows per day. My heart beats for postgres. We use it since several years. On the other hand, the sentence "Don't store logs in a DB" is somewhere in my head..... What do you think? -- Thomas Guettler http://www.thomas-guettler.de/
On 8/19/2016 12:32 AM, Thomas Güttler wrote: > What do you think? I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. in fact thats several rows/second on a 24/7 basis -- john r pierce, recycling bits in santa cruz
Am 19.08.2016 um 09:42 schrieb John R Pierce: > On 8/19/2016 12:32 AM, Thomas Güttler wrote: >> What do you think? > > I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. > > 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. > > in fact thats several rows/second on a 24/7 basis There is no need to store them more then 6 weeks in my current use case. I think indexing in postgres is much faster than grep. And queries including json data are not possible with grep (or at least very hard to type) My concern is which DB (or indexing) to use ... Regards, Thomas -- Thomas Guettler http://www.thomas-guettler.de/
On Fri, Aug 19, 2016 at 4:58 PM Thomas Güttler <guettliml@thomas-guettler.de> wrote:
Am 19.08.2016 um 09:42 schrieb John R Pierce:
> On 8/19/2016 12:32 AM, Thomas Güttler wrote:
>> What do you think?
>
> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>
> 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>
> in fact thats several rows/second on a 24/7 basis
There is no need to store them more then 6 weeks in my current use case.
I think indexing in postgres is much faster than grep.
And queries including json data are not possible with grep (or at least very hard to type)
My concern is which DB (or indexing) to use ...
How will you be using the logs? What kind of queries? What kind of searches?
Correlating events and logs from various sources could be really easy with joins, count and summary operations.
Correlating events and logs from various sources could be really easy with joins, count and summary operations.
The kind of volume you are anticipating should be fine with Postgres but before you really decide which one, you need to figure out what would you want to do with this data once it is in Postgres.
Regards,
Thomas
--
Thomas Guettler http://www.thomas-guettler.de/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
--
Best Regards
Sameer Kumar | DB Solution Architect
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
T: +65 6438 3504 | M: +65 8110 0350
Skype: sameer.ashnik | www.ashnik.com
Am 19.08.2016 um 11:21 schrieb Sameer Kumar: > > > On Fri, Aug 19, 2016 at 4:58 PM Thomas Güttler <guettliml@thomas-guettler.de <mailto:guettliml@thomas-guettler.de>> wrote: > > > > Am 19.08.2016 um 09:42 schrieb John R Pierce: > > On 8/19/2016 12:32 AM, Thomas Güttler wrote: > >> What do you think? > > > > I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. > > > > 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. > > > > in fact thats several rows/second on a 24/7 basis > > There is no need to store them more then 6 weeks in my current use case. > > I think indexing in postgres is much faster than grep. > > And queries including json data are not possible with grep (or at least very hard to type) > > My concern is which DB (or indexing) to use ... > > > How will you be using the logs? What kind of queries? What kind of searches? > Correlating events and logs from various sources could be really easy with joins, count and summary operations. Wishes raise with possibilities. First I want to do simple queries about hosts and timestamps. Then some simple substring matches. Up to now to structured logging (the json column) gets created. But if it gets filled, we will find a use case where we use ssh+grep up to now. Up to now we need no stemming and language support. > The kind of volume you are anticipating should be fine with Postgres but before you really decide which one, you need to > figure out what would you want to do with this data once it is in Postgres. The goal is a bit fuzzy up to now: Better overview. Thank you for your feedback ("The kind of volume you are anticipating should be fine with Postgres"). I guess I will use postgres, especial since Django ORM supports JSON in Postgres: https://docs.djangoproject.com/en/1.10/ref/contrib/postgres/fields/#jsonfield Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/
W dniu 19.08.2016 o 10:57, Thomas Güttler pisze: > > > Am 19.08.2016 um 09:42 schrieb John R Pierce: [-------------] >> in fact thats several rows/second on a 24/7 basis > > There is no need to store them more then 6 weeks in my current use case. > > I think indexing in postgres is much faster than grep. Not so much IMHE(xperience). 1. if you know what you are looking for: grep's the best 2. if you dont .... grep (or more/less/vi) is the best. only when you routinely update/join/etc, RDBMS really shines.But that's not what you normally do with your logs. Right? but then again. there is an additional benefit of "having everyting under one hood" - so standarising on a single repository (like rdbms) has its benefits. regards, -R
Thomas Güttler <guettliml@thomas-guettler.de> wrote: >> How will you be using the logs? What kind of queries? What kind of searches? >> Correlating events and logs from various sources could be really easy with joins, count and summary operations. > > Wishes raise with possibilities. First I want to do simple queries about > hosts and timestamps. Then some simple substring matches. for append-only tables like this consider 9.5 and BRIN-Indexes for timestamp-searches. But if you deletes after N weeks BRIN shouldn't work properly because of vacuum and re-use of space within the table. Do you know BRIN? So, in your case, consider partitioning, maybe per month. So you can also avoid mess with table and index bloat. Greetings from Dresden to Chemnitz (is this still valid?) Regards, Andreas Kretschmer -- Andreas Kretschmer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Aug 19, 2016 at 12:44 PM, Andreas Kretschmer <akretschmer@spamfence.net> wrote: > for append-only tables like this consider 9.5 and BRIN-Indexes for > timestamp-searches. But if you deletes after N weeks BRIN shouldn't work > properly because of vacuum and re-use of space within the table. > Do you know BRIN? > > So, in your case, consider partitioning, maybe per month. So you can > also avoid mess with table and index bloat. If done properly he can use both. For 6 weeks I would use seven partition, current+6 previous, drop old partition weekly, so effectively they become append only and he can use BRIN too. Even better, if he normally inserts in batches ( it happens to me with some log-like data, I rotate the file and insert all rotated data periodically ) he can use a staging table ( 1 master, and inheriting from it seven constrained week partition plus one unconstrained staging partition). Insertions go into staging and are moved with a small delay to the corresponding partition, using and ordered select so they go in perfect order into their final resting place and it can be vacuumed just after that ( if they are log lines and the maximum delay is X you just move every row older than that from staging to the partition with whatever period is best). Staging partition is normally small and cached and can be processed quite fast ( with 200k/day an hourly movement will leave staging with less than about 10k rows if distribution is somehow uniform ). Francisco Olarte.
Am 19.08.2016 um 12:44 schrieb Andreas Kretschmer: > Thomas Güttler <guettliml@thomas-guettler.de> wrote: > >>> How will you be using the logs? What kind of queries? What kind of searches? >>> Correlating events and logs from various sources could be really easy with joins, count and summary operations. >> >> Wishes raise with possibilities. First I want to do simple queries about >> hosts and timestamps. Then some simple substring matches. > > for append-only tables like this consider 9.5 and BRIN-Indexes for > timestamp-searches. But if you deletes after N weeks BRIN shouldn't work > properly because of vacuum and re-use of space within the table. > Do you know BRIN? > > So, in your case, consider partitioning, maybe per month. So you can > also avoid mess with table and index bloat. Thank you very much for these hints. I did not know BRIN before. > Greetings from Dresden to Chemnitz (is this still valid?) Yes, I am in Chemnitz/Germany. Everything (kids, wife, friends, sports, job) is fine. I hope the same with you? Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/
On 8/19/2016 3:44 AM, Andreas Kretschmer wrote: > So, in your case, consider partitioning, maybe per month. So you can > also avoid mess with table and index bloat. with his 6 week retention, i'd partition by week. -- john r pierce, recycling bits in santa cruz
On Fri, Aug 19, 2016 at 2:32 AM, Thomas Güttler <guettliml@thomas-guettler.de> wrote: > I want to store logs in a simple table. > > Here my columns: > > Primary-key (auto generated) > timestamp > host > service-on-host > loglevel > msg > json (optional) > > I am unsure which DB to choose: Postgres, ElasticSearch or ...? We use SOLR (which is similar to ElasticSearch) here for json document retrieval. Agreeing to do this was one of the biggest mistakes in my professional career. This choice was somewhat forced because at the time jsonb was not baked. In my opinion, jsonb outclasses these types of services particularly if you are already invested in postgres. The specifics of your requirements also plays into this decision naturally. The bottom line though is that these kinds of systems are not nearly as fast or robust as they claim to be particularly if you wander off the use cases they are engineered for (like needing transactions or joins for example). They also tend to be fairly opaque in how they operate and the supporting tooling is laughable relative to established database systems. Postgres OTOH can be made to do pretty much anything given sufficient expertise and a progressive attitude. merlin
On 8/19/2016 2:32 AM, Thomas Güttler wrote: > I want to store logs in a simple table. > > Here my columns: > > Primary-key (auto generated) > timestamp > host > service-on-host > loglevel > msg > json (optional) > > I am unsure which DB to choose: Postgres, ElasticSearch or ...? > > We don't have high traffic. About 200k rows per day. > > My heart beats for postgres. We use it since several years. > > On the other hand, the sentence "Don't store logs in a DB" is > somewhere in my head..... > > What do you think? > > > I played with ElasticSearch a little, mostly because I wanted to use Kibana which looks really pretty. I dumped a ton of logs into it, and made a pretty dashboard ... but in the end it didn't really help me, and wasn't that useful. My problem is, I don't want to have to go look at it. If something goes bad, then I want an email alert, at which point I'm going to go run top, and tail the logs. Another problem I had with kibana/ES is the syntax to search stuff is different than I'm used to. It made it hard to find stuff in kibana. Right now, I have a perl script that reads apache logs and fires off updates into PG to keep stats. But its an hourly summary, which the website turns around and queries the stats to show pretty usage graphs. In the end, PG or ES, all depends on what you want. -Andy
On 19/08/16 10:57, Thomas Güttler wrote: >>> What do you think? >> >> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. >> >> 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. >> >> in fact thats several rows/second on a 24/7 basis > > There is no need to store them more then 6 weeks in my current use case. Hi, to me this kind of data looks like something Postgres can handle with ease. We're talking about 8.4M rows here. Coincidentally, I was trying out the new parallel query feature in the 9.6 beta just now and decided to use your numbers as a test case :) I can create 8.4M records having a timestamp and a random ~ 250 character string in ~ 31 seconds: pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts, pg96-# repeat(random()::text, 15) as msg pg96-# into t1 pg96-# from generate_series(1, 6 * 7 * 200000) as sec; SELECT 8400000 Time: 30858.274 ms Table size is 2.4 GB. This gives about 6 weeks. A query to scan the whole thing on the narrow column takes ~ 400 msec, like this: pg96=# select min(ts), max(ts) from t1; min | max -------------------------------+------------------------------- 2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00 (1 row) Time: 409.468 ms Even running an unanchored regular expression (!) on the wider column is doable: pg96=# select count(*) from t1 where msg ~ '12345'; count ------- 955 (1 row) Time: 3146.838 ms If you have some filter, not everything needs to be regexped and this gets pretty fast: pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345'; count ------- 24 (1 row) Time: 391.577 ms All this is without indices. Your data is more structured than my test, so undoubtly you will get some gain from indices... Here is something more analytical - basically same as the count(*) above: pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date; ts | count ------------+------- 2016-08-19 | 26 2016-08-20 | 28 [...] 2016-09-28 | 21 2016-09-29 | 33 (42 rows) Time: 3157.010 ms Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost like cheating ;) Bye, Chris.
On Sat, 20 Aug 2016, 2:00 a.m. Andy Colson, <andy@squeakycode.net> wrote:
On 8/19/2016 2:32 AM, Thomas Güttler wrote:
> I want to store logs in a simple table.
>
> Here my columns:
>
> Primary-key (auto generated)
> timestamp
> host
> service-on-host
> loglevel
> msg
> json (optional)
>
> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>
> We don't have high traffic. About 200k rows per day.
>
> My heart beats for postgres. We use it since several years.
>
> On the other hand, the sentence "Don't store logs in a DB" is
> somewhere in my head.....
>
> What do you think?
>
>
>
I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton of logs into it, and
made a pretty dashboard ... but in the end it didn't really help me, and
wasn't that useful. My problem is, I don't want to have to go look at
it. If something goes bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.
There are tools from Elastic Stack which could have helped you achieve email alerts and gather top or tailing of logfile
Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to find stuff in kibana.
Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly summary, which the
website turns around and queries the stats to show pretty usage graphs.
In the end, PG or ES, all depends on what you want.
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
--
Best Regards
Sameer Kumar | DB Solution Architect
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
T: +65 6438 3504 | M: +65 8110 0350
Skype: sameer.ashnik | www.ashnik.com
Thank you Chris for looking at my issue in such detail. Yes, the parallel feature rocks. Regards, Thomas Güttler Am 19.08.2016 um 22:40 schrieb Chris Mair: > On 19/08/16 10:57, Thomas Güttler wrote: > > >>>> What do you think? >>> >>> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. >>> >>> 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. >>> >>> in fact thats several rows/second on a 24/7 basis >> >> There is no need to store them more then 6 weeks in my current use case. > > > Hi, > > to me this kind of data looks like something Postgres can handle with ease. > > We're talking about 8.4M rows here. > > Coincidentally, I was trying out the new parallel query feature in the > 9.6 beta just now and decided to use your numbers as a test case :) > > I can create 8.4M records having a timestamp and a random ~ 250 character string > in ~ 31 seconds: > > pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts, > pg96-# repeat(random()::text, 15) as msg > pg96-# into t1 > pg96-# from generate_series(1, 6 * 7 * 200000) as sec; > SELECT 8400000 > Time: 30858.274 ms > > Table size is 2.4 GB. > > This gives about 6 weeks. A query to scan the whole thing on the narrow column > takes ~ 400 msec, like this: > > pg96=# select min(ts), max(ts) from t1; > min | max > -------------------------------+------------------------------- > 2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00 > (1 row) > > Time: 409.468 ms > > Even running an unanchored regular expression (!) on the wider column is doable: > > pg96=# select count(*) from t1 where msg ~ '12345'; > count > ------- > 955 > (1 row) > > Time: 3146.838 ms > > If you have some filter, not everything needs to be regexped and this gets pretty fast: > > pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345'; > count > ------- > 24 > (1 row) > > Time: 391.577 ms > > All this is without indices. Your data is more structured than my test, so undoubtly you will > get some gain from indices... > > Here is something more analytical - basically same as the count(*) above: > > pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date; > ts | count > ------------+------- > 2016-08-19 | 26 > 2016-08-20 | 28 > [...] > 2016-09-28 | 21 > 2016-09-29 | 33 > (42 rows) > > Time: 3157.010 ms > > Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is > run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and > the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost > like cheating ;) > > Bye, > Chris. > > > > > -- Thomas Guettler http://www.thomas-guettler.de/
Am 19.08.2016 um 19:59 schrieb Andy Colson: > On 8/19/2016 2:32 AM, Thomas Güttler wrote: >> I want to store logs in a simple table. >> >> Here my columns: >> >> Primary-key (auto generated) >> timestamp >> host >> service-on-host >> loglevel >> msg >> json (optional) >> >> I am unsure which DB to choose: Postgres, ElasticSearch or ...? >> >> We don't have high traffic. About 200k rows per day. >> >> My heart beats for postgres. We use it since several years. >> >> On the other hand, the sentence "Don't store logs in a DB" is >> somewhere in my head..... >> >> What do you think? >> >> >> > > I played with ElasticSearch a little, mostly because I wanted to use Kibana which looks really pretty. I dumped a ton > of logs into it, and made a pretty dashboard ... but in the end it didn't really help me, and wasn't that useful. My > problem is, I don't want to have to go look at it. If something goes bad, then I want an email alert, at which point > I'm going to go run top, and tail the logs. > > Another problem I had with kibana/ES is the syntax to search stuff is different than I'm used to. It made it hard to > find stuff in kibana. > > Right now, I have a perl script that reads apache logs and fires off updates into PG to keep stats. But its an hourly > summary, which the website turns around and queries the stats to show pretty usage graphs. You use Perl to read apache logs. Does this work? Forwarding logs reliably is not easy. Logs are streams, files in unix are not streams. Sooner or later the files get rotated. RELP exists, but AFAIK it's usage is not wide spread: https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol Let's see how to get the logs into postgres .... > In the end, PG or ES, all depends on what you want. Most of my logs start from a http request. I want a unique id per request in every log line which gets created. This way I can trace the request, even if its impact spans to several hosts and systems which do not receive http requests. Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/
On Mon, 22 Aug 2016, 3:40 p.m. Thomas Güttler, <guettliml@thomas-guettler.de> wrote:
Am 19.08.2016 um 19:59 schrieb Andy Colson:
> On 8/19/2016 2:32 AM, Thomas Güttler wrote:
>> I want to store logs in a simple table.
>>
>> Here my columns:
>>
>> Primary-key (auto generated)
>> timestamp
>> host
>> service-on-host
>> loglevel
>> msg
>> json (optional)
>>
>> I am unsure which DB to choose: Postgres, ElasticSearch or ...?
>>
>> We don't have high traffic. About 200k rows per day.
>>
>> My heart beats for postgres. We use it since several years.
>>
>> On the other hand, the sentence "Don't store logs in a DB" is
>> somewhere in my head.....
>>
>> What do you think?
>>
>>
>>
>
> I played with ElasticSearch a little, mostly because I wanted to use Kibana which looks really pretty. I dumped a ton
> of logs into it, and made a pretty dashboard ... but in the end it didn't really help me, and wasn't that useful. My
> problem is, I don't want to have to go look at it. If something goes bad, then I want an email alert, at which point
> I'm going to go run top, and tail the logs.
>
> Another problem I had with kibana/ES is the syntax to search stuff is different than I'm used to. It made it hard to
> find stuff in kibana.
>
> Right now, I have a perl script that reads apache logs and fires off updates into PG to keep stats. But its an hourly
> summary, which the website turns around and queries the stats to show pretty usage graphs.
You use Perl to read apache logs. Does this work?
Forwarding logs reliably is not easy. Logs are streams, files in unix are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide spread:
https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol
Let's see how to get the logs into postgres ....
> In the end, PG or ES, all depends on what you want.
Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not receive http requests.
You may decide not to use Elasticsearch but take a look at other components of Elastic Stack like logstash and beats. They can be helpful even when you use Postgres as the end point. Otherwise (IMHO), you would spend a lot of time writing scripts and jobs to capture and stream logs. If I were you, I would not want to do that.
Regards,
Thomas Güttler
--
Thomas Guettler http://www.thomas-guettler.de/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
--
Best Regards
Sameer Kumar | DB Solution Architect
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533
T: +65 6438 3504 | M: +65 8110 0350
Skype: sameer.ashnik | www.ashnik.com
On 8/22/2016 2:39 AM, Thomas Güttler wrote: > > > Am 19.08.2016 um 19:59 schrieb Andy Colson: >> On 8/19/2016 2:32 AM, Thomas Güttler wrote: >>> I want to store logs in a simple table. >>> >>> Here my columns: >>> >>> Primary-key (auto generated) >>> timestamp >>> host >>> service-on-host >>> loglevel >>> msg >>> json (optional) >>> >>> I am unsure which DB to choose: Postgres, ElasticSearch or ...? >>> >>> We don't have high traffic. About 200k rows per day. >>> >>> My heart beats for postgres. We use it since several years. >>> >>> On the other hand, the sentence "Don't store logs in a DB" is >>> somewhere in my head..... >>> >>> What do you think? >>> >>> >>> >> >> I played with ElasticSearch a little, mostly because I wanted to use >> Kibana which looks really pretty. I dumped a ton >> of logs into it, and made a pretty dashboard ... but in the end it >> didn't really help me, and wasn't that useful. My >> problem is, I don't want to have to go look at it. If something goes >> bad, then I want an email alert, at which point >> I'm going to go run top, and tail the logs. >> >> Another problem I had with kibana/ES is the syntax to search stuff is >> different than I'm used to. It made it hard to >> find stuff in kibana. >> >> Right now, I have a perl script that reads apache logs and fires off >> updates into PG to keep stats. But its an hourly >> summary, which the website turns around and queries the stats to show >> pretty usage graphs. > > You use Perl to read apache logs. Does this work? > > Forwarding logs reliably is not easy. Logs are streams, files in unix > are not streams. Sooner or later > the files get rotated. RELP exists, but AFAIK it's usage is not wide > spread: > > https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol > > Let's see how to get the logs into postgres .... > >> In the end, PG or ES, all depends on what you want. > > Most of my logs start from a http request. I want a unique id per request > in every log line which gets created. This way I can trace the request, > even if its impact spans to several hosts and systems which do not > receive http requests. > > Regards, > Thomas Güttler > > I don't read the file. In apache.conf: # v, countyia, ip, sess, ts, url, query, status LogFormat "3,%{countyName}e,%a,%{VCSID}C,%{%Y-%m-%dT%H:%M:%S%z}t,\"%U\",\"%q\",%>s" csv3 CustomLog "|/usr/local/bin/statSender.pl -r 127.0.0.1" csv3 I think I read somewhere that if you pipe to a script (like above) and you dont read fast enough, it could slow apache down. That's why the script above dumps do redis first. That way I can move processes around, restart the database, etc, etc, and not break apache in any way. The important part of the script: while (my $x = <>) { chomp($x); next unless ($x); try_again: if ($redis) { eval { $redis->lpush($qname, $x); }; if ($@) { $redis = redis_connect(); goto try_again; } # just silence this one eval { $redis->ltrim($qname, 0, 1000); }; } } Any other machine, or even multiple, then reads from redis and inserts into PG. You can see, in my script, I trim the queue to 1000 items, but that's because I'm not as worried about loosing results. Your setup would probably be different. I also setup redis to not save anything to disk, again, because I don't mind if I loose a few hits here or there. But you get the idea. -Andy
Certainly Postgres is capable of handling this volume just fine. Throw in some partition rotation handling and you have a solution.
If you want to play with something different, check out Graylog, which is backed by Elasticsearch. A bit more work to set up than a single Postgres table, but it has ben a success for us storing, syslog, app logs, and Postgres logs from several hundred network devices, Windows and Linux servers. Rotation is handled based on your requirements and drilling down to the details is trivial. Alerting is baked in as well. It could well be overkill for your needs, but I don't know what your environment looks like.
T
On Mon, Aug 22, 2016 at 7:03 AM, Andy Colson <andy@squeakycode.net> wrote:
I don't read the file. In apache.conf:On 8/22/2016 2:39 AM, Thomas Güttler wrote:
Am 19.08.2016 um 19:59 schrieb Andy Colson:On 8/19/2016 2:32 AM, Thomas Güttler wrote:I want to store logs in a simple table.
Here my columns:
Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)
I am unsure which DB to choose: Postgres, ElasticSearch or ...?
We don't have high traffic. About 200k rows per day.
My heart beats for postgres. We use it since several years.
On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....
What do you think?
I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton
of logs into it, and made a pretty dashboard ... but in the end it
didn't really help me, and wasn't that useful. My
problem is, I don't want to have to go look at it. If something goes
bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.
Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to
find stuff in kibana.
Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly
summary, which the website turns around and queries the stats to show
pretty usage graphs.
You use Perl to read apache logs. Does this work?
Forwarding logs reliably is not easy. Logs are streams, files in unix
are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide
spread:
https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protoco l
Let's see how to get the logs into postgres ....In the end, PG or ES, all depends on what you want.
Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not
receive http requests.
Regards,
Thomas Güttler
# v, countyia, ip, sess, ts, url, query, status
LogFormat "3,%{countyName}e,%a,%{VCSID}C,%{%Y-%m-%dT%H:%M:%S%z}t,\"%U\ ",\"%q\",%>s" csv3
CustomLog "|/usr/local/bin/statSender.pl -r 127.0.0.1" csv3
I think I read somewhere that if you pipe to a script (like above) and you dont read fast enough, it could slow apache down. That's why the script above dumps do redis first. That way I can move processes around, restart the database, etc, etc, and not break apache in any way.
The important part of the script:
while (my $x = <>)
{
chomp($x);
next unless ($x);
try_again:
if ($redis)
{
eval {
$redis->lpush($qname, $x);
};
if ($@)
{
$redis = redis_connect();
goto try_again;
}
# just silence this one
eval {
$redis->ltrim($qname, 0, 1000);
};
}
}
Any other machine, or even multiple, then reads from redis and inserts into PG.
You can see, in my script, I trim the queue to 1000 items, but that's because I'm not as worried about loosing results. Your setup would probably be different. I also setup redis to not save anything to disk, again, because I don't mind if I loose a few hits here or there. But you get the idea.
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
> You may decide not to use Elasticsearch but take a look at other components of Elastic Stack like logstash and beats. > They can be helpful even when you use Postgres as the end point. Otherwise (IMHO), you would spend a lot of time writing > scripts and jobs to capture and stream logs. If I were you, I would not want to do that. I wanted to know what Beats is. I found this: {{{ Beats is the platform for building lightweight, open source data shippers for many types of data you want to enrich with Logstash, search and analyze in Elasticsearch, and visualize in Kibana. }}} Oh what fun. How does "enrich data" work? Imagine you have N bytes. There are several ways to create N+M bytes from N bytes. I call this bloating, not enriching. You can analyze data, compress it, visualize it.... But AFAIK you can't enrich data. Reading the next pages of the advertisement: {{{ enriched and combined with other data sources using Logstash }}} OK, this way data can be enriched by using other data sources. Still I see no use case for beats. Do you use it? How does it help you? I guess I take the boring route. I never used it, but it looks solid: https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/
Am 23.08.2016 um 22:42 schrieb Terry Schmitt: > Certainly Postgres is capable of handling this volume just fine. Throw in some partition rotation handling and you have > a solution. > If you want to play with something different, check out Graylog, which is backed by Elasticsearch. A bit more work to > set up than a single Postgres table, but it has ben a success for us storing, syslog, app logs, and Postgres logs from > several hundred network devices, Windows and Linux servers. Rotation is handled based on your requirements and drilling > down to the details is trivial. Alerting is baked in as well. It could well be overkill for your needs, but I don't know > what your environment looks like. Thank you for this hint. I will look at it. Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/