Thread: [PERFORM] Store/Retrieve time series data from PostgreSQL

[PERFORM] Store/Retrieve time series data from PostgreSQL

From
Subramaniam C
Date:
Hi

Requirement :- 
We need to retrieve latest health of around 1.5 million objects for a given time.

Implementation :-
We are storing hourly data of each object in single row. Given below is the schema :-
     
CREATE TABLE health_timeseries (
       mobid text NOT NULL,      hour bigint NOT NULL,      health real[]    );

mobId - Object ID
hour - Epoch hour
health - Array of health values for a given hour of that object.

Each object has 2 hours of health data (i.e. 2 rows for each object) so total no. of rows is around 3 million.

With the above approach the query to retrieve the latest health of all objects for a given time duration is taking around    85 seconds. I have tried to increase the work_mem, effective_cache, shared_buffer to 4 GB of PostgreSQL but still there was no improvement in the query execution time.

Query :-
select distinct on (health_timeseries.mobid) mobid, health_timeseries.health, health_timeseries.hour from health_timeseries where hour >=(1505211054000/(3600*1000))-1 and hour <= 1505211054000/(3600*1000) ORDER BY health_timeseries.mobid DESC, health_timeseries.hour DESC;


Hardware Configuration of PostgreSQL VM :-
1. OS - Centos.
2. Postgresql version - 9.6.2
3. RAM - 16 GB RAM
4. CPU - 8 vCPU

Please let us know the hardware configuration of PostgreSQL for such huge dataset?

And also let us know if there is any better schema/query to retrieve this data?

Thanks and Regards
Subramaniam
    
      


Re: [PERFORM] Store/Retrieve time series data from PostgreSQL

From
vinny
Date:
On 2017-09-14 13:51, Subramaniam C wrote:
> Hi
> 
> QUERY :-
> 
> _select distinct on (health_timeseries.mobid) mobid,
> health_timeseries.health, health_timeseries.hour from
> health_timeseries where hour >=(1505211054000/(3600*1000))-1 and hour
> <= 1505211054000/(3600*1000) ORDER BY health_timeseries.mobid DESC,
> health_timeseries.hour DESC;_
> 

Did you run EXPLAIN on this query to see what it is actually doing?

What you are doing how is selecting all rows from the last hour,
sorting them by mobid and hour, and then DISTINCT filters out al 
duplicates.

Sorting on mobid is therefor useless, DISTINCT still has to check all 
rows.

Sorting on mobid and hour will take a long time if there is no index for 
it,
so if you don't have an index on the mobid and hour together then you 
should probably try that.


But, see what EXPLAIN tells you first.

Regards,
Vincent.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Store/Retrieve time series data from PostgreSQL

From
Subramaniam C
Date:
I created index on morbid and hour together. Given below is the EXPLAIN output

------------------------------------------------------------------------------------------

 Unique  (cost=606127.16..621098.42 rows=1087028 width=200)

   ->  Sort  (cost=606127.16..613612.79 rows=2994252 width=200)

         Sort Key: mobid DESC, hour DESC

         ->  Seq Scan on health_timeseries  (cost=0.00..284039.00 rows=2994252 width=200)

               Filter: ((hour >= '418134'::bigint) AND (hour <= '418135'::bigint))


On Thu, Sep 14, 2017 at 5:33 PM, vinny <vinny@xs4all.nl> wrote:
On 2017-09-14 13:51, Subramaniam C wrote:
Hi

QUERY :-

_select distinct on (health_timeseries.mobid) mobid,
health_timeseries.health, health_timeseries.hour from
health_timeseries where hour >=(1505211054000/(3600*1000))-1 and hour
<= 1505211054000/(3600*1000) ORDER BY health_timeseries.mobid DESC,
health_timeseries.hour DESC;_


Did you run EXPLAIN on this query to see what it is actually doing?

What you are doing how is selecting all rows from the last hour,
sorting them by mobid and hour, and then DISTINCT filters out al duplicates.

Sorting on mobid is therefor useless, DISTINCT still has to check all rows.

Sorting on mobid and hour will take a long time if there is no index for it,
so if you don't have an index on the mobid and hour together then you should probably try that.


But, see what EXPLAIN tells you first.

Regards,
Vincent.

Re: [PERFORM] Store/Retrieve time series data from PostgreSQL

From
Subramaniam C
Date:
With this query I am trying to get the latest hour for a given timestamp so that I can get whole health array of all object for a given hour. So I am doing DISTINCT on mobid and order by hour and mobid DESC.

On Thu, Sep 14, 2017 at 6:03 PM, Subramaniam C <subramaniam31784@gmail.com> wrote:
I created index on morbid and hour together. Given below is the EXPLAIN output

------------------------------------------------------------------------------------------

 Unique  (cost=606127.16..621098.42 rows=1087028 width=200)

   ->  Sort  (cost=606127.16..613612.79 rows=2994252 width=200)

         Sort Key: mobid DESC, hour DESC

         ->  Seq Scan on health_timeseries  (cost=0.00..284039.00 rows=2994252 width=200)

               Filter: ((hour >= '418134'::bigint) AND (hour <= '418135'::bigint))


On Thu, Sep 14, 2017 at 5:33 PM, vinny <vinny@xs4all.nl> wrote:
On 2017-09-14 13:51, Subramaniam C wrote:
Hi

QUERY :-

_select distinct on (health_timeseries.mobid) mobid,
health_timeseries.health, health_timeseries.hour from
health_timeseries where hour >=(1505211054000/(3600*1000))-1 and hour
<= 1505211054000/(3600*1000) ORDER BY health_timeseries.mobid DESC,
health_timeseries.hour DESC;_


Did you run EXPLAIN on this query to see what it is actually doing?

What you are doing how is selecting all rows from the last hour,
sorting them by mobid and hour, and then DISTINCT filters out al duplicates.

Sorting on mobid is therefor useless, DISTINCT still has to check all rows.

Sorting on mobid and hour will take a long time if there is no index for it,
so if you don't have an index on the mobid and hour together then you should probably try that.


But, see what EXPLAIN tells you first.

Regards,
Vincent.