Re: PG vs ElasticSearch for Logs - Mailing list pgsql-general

From Thomas Güttler
Subject Re: PG vs ElasticSearch for Logs
Date
Msg-id 6ff8364d-4854-ab98-087a-45c618d142be@thomas-guettler.de
Whole thread Raw
In response to Re: PG vs ElasticSearch for Logs  (Chris Mair <chris@1006.org>)
List pgsql-general
Thank you Chris for looking at my issue in such detail.

Yes, the parallel feature rocks.

Regards,
   Thomas Güttler

Am 19.08.2016 um 22:40 schrieb Chris Mair:
> On 19/08/16 10:57, Thomas Güttler wrote:
>
>
>>>> What do you think?
>>>
>>> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.
>>>
>>>  200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.
>>>
>>> in fact thats several rows/second on a 24/7 basis
>>
>> There is no need to store them more then 6 weeks in my current use case.
>
>
> Hi,
>
> to me this kind of data looks like something Postgres can handle with ease.
>
> We're talking about 8.4M rows here.
>
> Coincidentally, I was trying out the new parallel query feature in the
> 9.6 beta just now and decided to use your numbers as a test case :)
>
> I can create 8.4M records having a timestamp and a random ~ 250 character string
> in ~ 31 seconds:
>
> pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts,
> pg96-#        repeat(random()::text, 15) as msg
> pg96-# into t1
> pg96-# from generate_series(1, 6 * 7 * 200000) as sec;
> SELECT 8400000
> Time: 30858.274 ms
>
> Table size is 2.4 GB.
>
> This gives about 6 weeks. A query to scan the whole thing on the narrow column
> takes ~ 400 msec, like this:
>
> pg96=# select min(ts), max(ts) from t1;
>               min              |              max
> -------------------------------+-------------------------------
>  2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00
> (1 row)
>
> Time: 409.468 ms
>
> Even running an unanchored regular expression (!) on the wider column is doable:
>
> pg96=# select count(*) from t1 where msg ~ '12345';
>  count
> -------
>    955
> (1 row)
>
> Time: 3146.838 ms
>
> If you have some filter, not everything needs to be regexped and this gets pretty fast:
>
> pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345';
>  count
> -------
>     24
> (1 row)
>
> Time: 391.577 ms
>
> All this is without indices. Your data is more structured than my test, so undoubtly you will
> get some gain from indices...
>
> Here is something more analytical - basically same as the count(*) above:
>
> pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date;
>      ts     | count
> ------------+-------
>  2016-08-19 |    26
>  2016-08-20 |    28
>  [...]
>  2016-09-28 |    21
>  2016-09-29 |    33
> (42 rows)
>
> Time: 3157.010 ms
>
> Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is
> run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and
> the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost
> like cheating ;)
>
> Bye,
> Chris.
>
>
>
>
>

--
Thomas Guettler http://www.thomas-guettler.de/


pgsql-general by date:

Previous
From: Michael Paquier
Date:
Subject: Re: incorrect checksum detected on "global/pg_filenode.map" when VACUUM FULL is executed
Next
From: Thomas Güttler
Date:
Subject: Re: PG vs ElasticSearch for Logs