Thread: [GENERAL] Large data and slow queries

[GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 07:01:40

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 07:05:16

Oh, I've also tried earth distance and ll_to_earth in a GIST index...
it was slower that the BTREE index on a small subset of data in my
tests.

On 19 April 2017 at 16:01, Samuel Williams
<space.ship.traveller@gmail.com> wrote:
> Hi.
>
> We have 400,000,000 records in a table (soon to be 800,000,000), here
> is the schema (\d+)
>
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>
> We want the following kinds of query to be fast:
>
> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);
>
> We have a btree index and it appears to be working. However, it's
> still pretty slow.
>
> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> I'm thinking that I need to do the following to help:
>
> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.
>
> Perhaps add a BRIN index on created_at
>
> I'm wondering if... we can use an index to cache, all user_ids seen on
> a given day. If we change our index to be more discrete, e.g.
> created_at::date, would this help? The set union of user_ids for 365
> days should be pretty fast?
>
> I'm open to any ideas or suggestions, ideally we can keep
> optimisations within the database, rather than adding a layer of
> caching on top.
>
> Kind regards,
> Samuel

Re: [GENERAL] Large data and slow queries

From

John R Pierce

Date:

19 April 2017, 07:42:03

On 4/18/2017 9:01 PM, Samuel Williams wrote:
> We want the following kinds of query to be fast:
>
> SELECT ... AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);


I wonder if GIST would work better if you use the native POINT type, and
compared it like

mypoint <@ BOX
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
))'

with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective,
thats where you want an index.

--
john r pierce, recycling bits in santa cruz

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 08:04:58

Thanks John. Yes, you are absolutely right, you want the index to be
bottom heavy so you can cull as much as possible at the top. I'm
familiar with that, once implementing a brute-force sudoku solver, it
has the same principle.

I've been working on this all afternoon. By reducing the longitude,
latitude columns to float4, in my test cases, I found about 50%
improvement in performance. It may also use less space. So part of the
problem was my choice of data type. We've computed that float4 has a
worst case precision of about 1.6m which we are okay with for
analytics data.

Another option we may consider is using a (signed) integer - e.g.
longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
uniform error across all points, but it's a bit more cumbersome to
handle. Is there a rational datatype in postgres which works like
this?

On 19 April 2017 at 16:42, John R Pierce <pierce@hogranch.com> wrote:
> On 4/18/2017 9:01 PM, Samuel Williams wrote:
>>
>> We want the following kinds of query to be fast:
>>
>> SELECT ... AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>
>
>
> I wonder if GIST would work better if you use the native POINT type, and
> compared it like
>
> mypoint <@ BOX
> '((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
> ))'
>
> with a gist index on mypoint...
>
> but, it all hinges on which clauses in your query are most selective, thats
> where you want an index.
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

From

vinny

Date:

19 April 2017, 10:31:41

On 2017-04-19 07:04, Samuel Williams wrote:
> Thanks John. Yes, you are absolutely right, you want the index to be
> bottom heavy so you can cull as much as possible at the top. I'm
> familiar with that, once implementing a brute-force sudoku solver, it
> has the same principle.
>
> I've been working on this all afternoon. By reducing the longitude,
> latitude columns to float4, in my test cases, I found about 50%
> improvement in performance. It may also use less space. So part of the
> problem was my choice of data type. We've computed that float4 has a
> worst case precision of about 1.6m which we are okay with for
> analytics data.
>
> Another option we may consider is using a (signed) integer - e.g.
> longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
> uniform error across all points, but it's a bit more cumbersome to
> handle. Is there a rational datatype in postgres which works like
> this?
>
>
>
> On 19 April 2017 at 16:42, John R Pierce <pierce@hogranch.com> wrote:
>> On 4/18/2017 9:01 PM, Samuel Williams wrote:
>>>
>>> We want the following kinds of query to be fast:
>>>
>>> SELECT ... AND (latitude > -37.03079375089291 AND latitude <
>>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>>> 175.0805140220076);
>>
>>
>>
>> I wonder if GIST would work better if you use the native POINT type,
>> and
>> compared it like
>>
>> mypoint <@ BOX
>> '((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
>> ))'
>>
>> with a gist index on mypoint...
>>
>> but, it all hinges on which clauses in your query are most selective,
>> thats
>> where you want an index.
>>
>> --
>> john r pierce, recycling bits in santa cruz
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general

Did that 50% performance gain come from just the datatype, or that fact
that the index became smaller?

Given the number of records, my first thought was either partitioning or
partial-indexes.
The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long
ranges or dates.

Re: [GENERAL] Large data and slow queries

From

John R Pierce

Date:

19 April 2017, 10:48:40

On 4/19/2017 12:31 AM, vinny wrote:
> Given the number of records, my first thought was either partitioning
> or partial-indexes.
> The fewer rows are in the index, the quicker it will be to check,
> and it's not a lot of work to create separate indexes for lat/long
> ranges or dates.

that only works if the planner can figure out which partitions to use in
advance, otherwise it ends up having to scan all the partitions.


--
john r pierce, recycling bits in santa cruz

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 13:12:31

> Did that 50% performance gain come from just the datatype, or that fact that the index became smaller?

How would one measure this?

On 19 April 2017 at 19:48, John R Pierce <pierce@hogranch.com> wrote:
> On 4/19/2017 12:31 AM, vinny wrote:
>>
>> Given the number of records, my first thought was either partitioning or
>> partial-indexes.
>> The fewer rows are in the index, the quicker it will be to check,
>> and it's not a lot of work to create separate indexes for lat/long ranges
>> or dates.
>
>
> that only works if the planner can figure out which partitions to use in
> advance, otherwise it ends up having to scan all the partitions.
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

From

Alban Hertroys

Date:

19 April 2017, 13:50:10

> On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:
>
> Hi.
>
> We have 400,000,000 records in a table (soon to be 800,000,000), here
> is the schema (\d+)
>
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>
> We want the following kinds of query to be fast:
>
> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);
>
> We have a btree index and it appears to be working. However, it's
> still pretty slow.
>
> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> I'm thinking that I need to do the following to help:
>
> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan all index entries from that date and
withinthose location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching
theselection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of
theindex entries that it has to scan now... 

How high is the correlation between user id and location? That can probably be leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single
purpose-specificone? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on
(created_at,user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of
indicesyou have on that table, as well as their sizes, making it all fit into cache a little easier. Then again,
additionaloperations will be required to combine them. 

For a different approach; It may be possible to enrich your data with something that is easy to index and query, with a
highcorrelation to (latitude, longitude). That could also be used to partition over. Country seems a good candidate
here,unless all your data-points are in New Zealand like the above? 
Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly
concentrated(Delhi, Moscow to name a few). 
Another option is to divide the location space up into squares of a fixed size, with a partition for each square. About
80%of those squares are unpopulated though, being at sea. 

Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a
certainage. Truncating or dropping a partition that you no longer need is quick and easy. 

With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of
partitionsunder a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's
possibleby abusing multiple levels of inheritance, but that sounds like a bad idea. 

And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would slow
downthe query planner. 

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 13:58:53

Thanks Alban, I appreciate your ideas and thoughts.

I'm a little reluctant to go down the partitioning route as I think
we'll probably end up sharding out horizontally using citus data in
the near future and naive postgres partitioning may hamper that
effort.

It's interesting you talk about using multiple indexes. In
MySQL/MariaDB and derivatives, I've never seen the query planner
consider using multiple indexes. So, it's possible that Postgres may
use multiple indexes if it saves time? Or do you mean, doing something
like manually joining the data and leveraging the different indexes
explicitly?

The correlation between user_id and location... well, it's somewhat
temporally related.



On 19 April 2017 at 22:50, Alban Hertroys <haramrae@gmail.com> wrote:
>
>> On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:
>>
>> Hi.
>>
>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>> is the schema (\d+)
>>
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>
>> We want the following kinds of query to be fast:
>>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>
>> I'm thinking that I need to do the following to help:
>>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>
> It seems to me that a large part of the problem is that the server has to scan all index entries from that date and
withinthose location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching
theselection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of
theindex entries that it has to scan now... 
>
> How high is the correlation between user id and location? That can probably be leveraged…
> Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single
purpose-specificone? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on
(created_at,user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of
indicesyou have on that table, as well as their sizes, making it all fit into cache a little easier. Then again,
additionaloperations will be required to combine them. 
>
> For a different approach; It may be possible to enrich your data with something that is easy to index and query, with
ahigh correlation to (latitude, longitude). That could also be used to partition over. Country seems a good candidate
here,unless all your data-points are in New Zealand like the above? 
> Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly
concentrated(Delhi, Moscow to name a few). 
> Another option is to divide the location space up into squares of a fixed size, with a partition for each square.
About80% of those squares are unpopulated though, being at sea. 
>
> Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a
certainage. Truncating or dropping a partition that you no longer need is quick and easy. 
>
> With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of
partitionsunder a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's
possibleby abusing multiple levels of inheritance, but that sounds like a bad idea. 
>
> And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would
slowdown the query planner. 
>
> Hopefully there's something useful in my ramblings!
>
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.
>

Re: [GENERAL] Large data and slow queries

From

vinny

Date:

19 April 2017, 14:22:56

On 2017-04-19 09:48, John R Pierce wrote:
> On 4/19/2017 12:31 AM, vinny wrote:
>> Given the number of records, my first thought was either partitioning
>> or partial-indexes.
>> The fewer rows are in the index, the quicker it will be to check,
>> and it's not a lot of work to create separate indexes for lat/long
>> ranges or dates.
>
> that only works if the planner can figure out which partitions to use
> in advance, otherwise it ends up having to scan all the partitions.
>
>
> --
> john r pierce, recycling bits in santa cruz


True, but then again, the proposed queries are quite straight-forward
so I don't expect that to be a problem, really.

Worth a test, if only to see if it helps.

Re: [GENERAL] Large data and slow queries

From

"Martijn Tonies \(Upscene Productions\)"

Date:

19 April 2017, 14:25:04

Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning behind
this.

For column 'what', it seems you have no index on all values, only indices
with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other columns?


With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com


-----Original Message-----
From: Samuel Williams
Sent: Wednesday, April 19, 2017 6:01 AM
To: pgsql-general
Subject: [GENERAL] Large data and slow queries

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 14:35:43

Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.

Re: [GENERAL] Large data and slow queries

From

vinny

Date:

19 April 2017, 14:37:33

On 2017-04-19 13:25, Martijn Tonies (Upscene Productions) wrote:
> Samuel, others,
>
> Perhaps I'm missing something, but I'd be interested in the reasoning
> behind this.
>
> For column 'what', it seems you have no index on all values, only
> indices with specific values for 'what'.
>
> How does this speed up the search? Will PostgreSQL use those indices,
> instead of using a generic index on 'what' and optionally other
> columns?
>
>
> With regards,
>
> Martijn Tonies
> Upscene Productions
> http://www.upscene.com
>

That's a "partial index", it only contains records that meet the
requirements of the index definition.

https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if
your query contains "where name=kees"
the planner can just load that index and know that the records in that
index will not contain
any other names, saving the need to filter for 'name=kees'

Re: [GENERAL] Large data and slow queries

From

"Martijn Tonies \(Upscene Productions\)"

Date:

19 April 2017, 15:17:15

Hello Vinny, Samuel,


>> Perhaps I'm missing something, but I'd be interested in the reasoning
>> behind this.
>>
>> For column 'what', it seems you have no index on all values, only
>> indices with specific values for 'what'.
>>
>> How does this speed up the search? Will PostgreSQL use those indices,
>> instead of using a generic index on 'what' and optionally other
>> columns?
>>
>>
>
>That's a "partial index", it only contains records that meet the
>requirements of the index definition.
>
>https://www.postgresql.org/docs/9.5/static/indexes-partial.html
>
>Basically; if you create an index on records where 'name = kees' then if
>your query contains "where name=kees"
>the planner can just load that index and know that the records in that
>index will not contain
>any other names, saving the need to filter for 'name=kees'
>
>
>Martijn that is a good question. It's because we are only concerned
>with a subset of events for this index and this particular query. The
>query planner can recognise this and use the index correctly. By doing
>this, we reduce the size of the index significantly. In the best case,
>where we only wanted a few things, the index was reduced from 21GB to
>8MB.


Thank for the answers, I've seen such indices before, but used, for example,
on an UPPER(...) of all values, never thought of using them to filter out
specific values in order to make the index smaller.


With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com

Database Workbench - developer tool for Oracle, MS SQL Server, PostgreSQL,
SQL Anywhere, MySQL, InterBase, NexusDB and Firebird.

Re: [GENERAL] Large data and slow queries

From

Stephen Frost

Date:

19 April 2017, 15:48:15

Greetings,

* Samuel Williams (space.ship.traveller@gmail.com) wrote:
> We want the following kinds of query to be fast:

"kinds of query" isn't helpful, you should be reviewing exactly the
queries you care about because statistics and your exact data set and
what the exact query you're running is will all have an impact.

> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
> "user_event"."what" IN ('poll', 'location_change',
> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
> 175.0805140220076);

Ugh.  You should really look at and consider PostGIS whenever you're
working with geospatial data.

> We have a btree index and it appears to be working. However, it's
> still pretty slow.

You're finding some 17M records and then reducing that with DISTINCT to
only 114k.  Hard to say if it's faster to just brute-force your way
through that with a HashAgg (as your explain analyze shows), or if a
loose index scan would work better (check out how to do one in PG here:
https://wiki.postgresql.org/wiki/Loose_indexscan).

> EXPLAIN ANALYZE gives the following:
> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

> CLUSTER user_event ON index_user_event_for_visits_3 followed by
> analyze... Our data is mostly time series but sometimes we get some
> dumps with historical records.
>
> Perhaps add a BRIN index on created_at

BRIN would be helpful if you wanted a smaller index.  That doesn't seem
to be the case here.

> I'm wondering if... we can use an index to cache, all user_ids seen on
> a given day. If we change our index to be more discrete, e.g.
> created_at::date, would this help? The set union of user_ids for 365
> days should be pretty fast?

Materialized views are often useful, particularly when the results are
(relatively) slow moving.

> I'm open to any ideas or suggestions, ideally we can keep
> optimisations within the database, rather than adding a layer of
> caching on top.

If you need to go the materialized view route, I'd definitely recommend
doing that in the database rather than outside it.

Thanks!

Stephen

Attachment

signature.asc

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 16:13:11

Thanks for all the suggestions Stephen.

> That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?



On 20 April 2017 at 00:48, Stephen Frost <sfrost@snowman.net> wrote:
> Greetings,
>
> * Samuel Williams (space.ship.traveller@gmail.com) wrote:
>> We want the following kinds of query to be fast:
>
> "kinds of query" isn't helpful, you should be reviewing exactly the
> queries you care about because statistics and your exact data set and
> what the exact query you're running is will all have an impact.
>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>
> Ugh.  You should really look at and consider PostGIS whenever you're
> working with geospatial data.
>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>
> You're finding some 17M records and then reducing that with DISTINCT to
> only 114k.  Hard to say if it's faster to just brute-force your way
> through that with a HashAgg (as your explain analyze shows), or if a
> loose index scan would work better (check out how to do one in PG here:
> https://wiki.postgresql.org/wiki/Loose_indexscan).
>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>>
>> Perhaps add a BRIN index on created_at
>
> BRIN would be helpful if you wanted a smaller index.  That doesn't seem
> to be the case here.
>
>> I'm wondering if... we can use an index to cache, all user_ids seen on
>> a given day. If we change our index to be more discrete, e.g.
>> created_at::date, would this help? The set union of user_ids for 365
>> days should be pretty fast?
>
> Materialized views are often useful, particularly when the results are
> (relatively) slow moving.
>
>> I'm open to any ideas or suggestions, ideally we can keep
>> optimisations within the database, rather than adding a layer of
>> caching on top.
>
> If you need to go the materialized view route, I'd definitely recommend
> doing that in the database rather than outside it.
>
> Thanks!
>
> Stephen

Re: [GENERAL] Large data and slow queries

From

Stephen Frost

Date:

19 April 2017, 16:25:10

Greetings,

* Samuel Williams (space.ship.traveller@gmail.com) wrote:
> Thanks for all the suggestions Stephen.
>
> > That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
>
> I'm using the default 9.6 config, I thought that auto-vacuum was on by default?

Sure, but that doesn't mean the table gets vacuumed.  In particular,
actual vacuums are only kicked off when the number of *updated* or
*deleted* tuples passes the autovacuum threshold.  If no updates or
deletes are done on the table (iow, it's essentially an insert-only or
insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
(though it'll run analyze's).

https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD

Check pg_stat_all_tables to see when/if the table has actually been
vacuumed.

Thanks!

Stephen

Attachment

signature.asc

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

19 April 2017, 18:13:24

Ah right, yeah, it's insert only. So, it's never been vacuumed.

On 20 April 2017 at 01:25, Stephen Frost <sfrost@snowman.net> wrote:
> Greetings,
>
> * Samuel Williams (space.ship.traveller@gmail.com) wrote:
>> Thanks for all the suggestions Stephen.
>>
>> > That explain analyze shows a whole ton of heap fetches.  When was the
>> last time a VACUUM was run on this table, to build the visibility map?
>> Without the visibility map being current, an Index-Only Scan, as is
>> happening here, can really suck.
>>
>> I'm using the default 9.6 config, I thought that auto-vacuum was on by default?
>
> Sure, but that doesn't mean the table gets vacuumed.  In particular,
> actual vacuums are only kicked off when the number of *updated* or
> *deleted* tuples passes the autovacuum threshold.  If no updates or
> deletes are done on the table (iow, it's essentially an insert-only or
> insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
> (though it'll run analyze's).
>
> https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD
>
> Check pg_stat_all_tables to see when/if the table has actually been
> vacuumed.
>
> Thanks!
>
> Stephen

Re: [GENERAL] Large data and slow queries

From

Alban Hertroys

Date:

19 April 2017, 18:20:27

> On 19 Apr 2017, at 12:58, Samuel Williams <space.ship.traveller@gmail.com> wrote:
>
> It's interesting you talk about using multiple indexes. In
> MySQL/MariaDB and derivatives, I've never seen the query planner
> consider using multiple indexes. So, it's possible that Postgres may
> use multiple indexes if it saves time? Or do you mean, doing something
> like manually joining the data and leveraging the different indexes
> explicitly?

PG is capable of doing bitmap heap scans to combine results from multiple indices, among other things.
Whether that will actually improve performance in this case I don't know, but it's worth a try I think.

> The correlation between user_id and location... well, it's somewhat
> temporally related.

So users are constantly moving around but happen to be at the same locations at regular intervals?

In my experience, people don't usually move around much, so you should certainly be able to pinpoint them mostly to a
specificarea, right? (Hence my suggestions for a country column or partitioning in squares) 


> On 19 April 2017 at 22:50, Alban Hertroys <haramrae@gmail.com> wrote:
>>
>>> On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:
>>>
>>> Hi.
>>>
>>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>>> is the schema (\d+)
>>>
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>>
>>> We want the following kinds of query to be fast:
>>>
>>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>>> "user_event"."what" IN ('poll', 'location_change',
>>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>>> 175.0805140220076);
>>>
>>> We have a btree index and it appears to be working. However, it's
>>> still pretty slow.
>>>
>>> EXPLAIN ANALYZE gives the following:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>>
>>> I'm thinking that I need to do the following to help:
>>>
>>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>>> analyze... Our data is mostly time series but sometimes we get some
>>> dumps with historical records.
>>
>> It seems to me that a large part of the problem is that the server has to scan all index entries from that date and
withinthose location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching
theselection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of
theindex entries that it has to scan now... 
>>
>> How high is the correlation between user id and location? That can probably be leveraged…
>> Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single
purpose-specificone? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on
(created_at,user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of
indicesyou have on that table, as well as their sizes, making it all fit into cache a little easier. Then again,
additionaloperations will be required to combine them. 
>>
>> For a different approach; It may be possible to enrich your data with something that is easy to index and query,
witha high correlation to (latitude, longitude). That could also be used to partition over. Country seems a good
candidatehere, unless all your data-points are in New Zealand like the above? 
>> Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly
concentrated(Delhi, Moscow to name a few). 
>> Another option is to divide the location space up into squares of a fixed size, with a partition for each square.
About80% of those squares are unpopulated though, being at sea. 
>>
>> Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a
certainage. Truncating or dropping a partition that you no longer need is quick and easy. 
>>
>> With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of
partitionsunder a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's
possibleby abusing multiple levels of inheritance, but that sounds like a bad idea. 
>>
>> And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would
slowdown the query planner. 
>>
>> Hopefully there's something useful in my ramblings!
>>
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll find there is no forest.
>>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

20 April 2017, 05:11:58

Okay, so after changing longitude/latitude to float4, and
re-organizing the table a bit, I got the query down from about 8
minutes to 40 seconds.

The details are in the gist comments:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

Now, just need to get performance another 2 orders of magnitude
better. Can we make the index a bit more coarse grained, perhaps
convert long/lat to integers or something, use a hilbert index, use a
postgis index... ideas?

Re: [GENERAL] Large data and slow queries

From

Vick Khera

Date:

20 April 2017, 15:30:09

I'm curious why you have so many partial indexes. Are you trying to make custom indexes per query? It seems to me you might want to consider making the indexes general, and remove the redundant ones (that have the same prefix list of indexed fields).

Secondly your table is 102Gb. Clearly there's a lot of data here. How many rows does that take? I would further suggest that you partition this table such that there are no more than about 10 million rows per partition (I've done this by using a id % 100 computation). Maybe in your case it makes sense to partition it based on the "what" field, because it appears you are trying to do that with your partial indexes already.

On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Okay, so after changing longitude/latitude to float4, and
re-organizing the table a bit, I got the query down from about 8
minutes to 40 seconds.

The details are in the gist comments:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

Now, just need to get performance another 2 orders of magnitude
better. Can we make the index a bit more coarse grained, perhaps
convert long/lat to integers or something, use a hilbert index, use a
postgis index... ideas?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

From

Andrew Staller

Date:

20 April 2017, 17:44:07

Awesome thread.

Samuel,

Just wanted you to be aware of the work we're doing at TimescaleDB (http://www.timescale.com/), a time-series database extension for PostgreSQL.

Some of how we might help you:

- automatic partitioning by space (primary key - like country_id, for instance) and time. This creates "chunks" of your data, right-sized by volume and time constraints (which makes inserts fast at much greater scale than Vanilla PostgreSQL - not sure if this is a problem for you)

- this will also help if your queries are selective on time and country_id (per this example)

- the partitioning by time allows you to DROP old chunks without the need for vacuums

On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera <vivek@khera.org> wrote:

I'm curious why you have so many partial indexes. Are you trying to make custom indexes per query? It seems to me you might want to consider making the indexes general, and remove the redundant ones (that have the same prefix list of indexed fields).

Secondly your table is 102Gb. Clearly there's a lot of data here. How many rows does that take? I would further suggest that you partition this table such that there are no more than about 10 million rows per partition (I've done this by using a id % 100 computation). Maybe in your case it makes sense to partition it based on the "what" field, because it appears you are trying to do that with your partial indexes already.

On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams <space.ship.traveller@gmail.com> wrote:
Okay, so after changing longitude/latitude to float4, and
re-organizing the table a bit, I got the query down from about 8
minutes to 40 seconds.

The details are in the gist comments:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

Now, just need to get performance another 2 orders of magnitude
better. Can we make the index a bit more coarse grained, perhaps
convert long/lat to integers or something, use a hilbert index, use a
postgis index... ideas?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

TimescaleDB | Growth & Developer Evangelism

c: 908.581.9509

335 Madison Ave.

New York, NY 10017

www.timescale.com

https://github.com/timescale/timescaledb

Re: [GENERAL] Large data and slow queries

From

Scott Marlowe

Date:

20 April 2017, 18:01:12

On Thu, Apr 20, 2017 at 6:30 AM, Vick Khera <vivek@khera.org> wrote:
> I'm curious why you have so many partial indexes. Are you trying to make
> custom indexes per query? It seems to me you might want to consider making
> the indexes general, and remove the redundant ones (that have the same
> prefix list of indexed fields).
>
> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
> rows does that take? I would further suggest that you partition this table
> such that there are no more than about 10 million rows per partition (I've
> done this by using a id % 100 computation). Maybe in your case it makes
> sense to partition it based on the "what" field, because it appears you are
> trying to do that with your partial indexes already.

I would think a two field index might be just as effective and not
require a lot of maintenance etc.

--
To understand recursion, one must first understand recursion.

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

21 April 2017, 00:43:59

Andrew, how would timescaledb compare to citus - and is timescaledb an
extension to postgres or is it an entirely separate system?

On 21 April 2017 at 02:44, Andrew Staller <andrew@timescale.com> wrote:
> Awesome thread.
>
> Samuel,
>
> Just wanted you to be aware of the work we're doing at TimescaleDB
> (http://www.timescale.com/), a time-series database extension for
> PostgreSQL.
>
> Some of how we might help you:
> - automatic partitioning by space (primary key - like country_id, for
> instance) and time. This creates "chunks" of your data, right-sized by
> volume and time constraints (which makes inserts fast at much greater scale
> than Vanilla PostgreSQL - not sure if this is a problem for you)
> - this will also help if your queries are selective on time and country_id
> (per this example)
> - the partitioning by time allows you to DROP old chunks without the need
> for vacuums
>
> On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera <vivek@khera.org> wrote:
>>
>> I'm curious why you have so many partial indexes. Are you trying to make
>> custom indexes per query? It seems to me you might want to consider making
>> the indexes general, and remove the redundant ones (that have the same
>> prefix list of indexed fields).
>>
>> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
>> rows does that take? I would further suggest that you partition this table
>> such that there are no more than about 10 million rows per partition (I've
>> done this by using a id % 100 computation). Maybe in your case it makes
>> sense to partition it based on the "what" field, because it appears you are
>> trying to do that with your partial indexes already.
>>
>> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
>> <space.ship.traveller@gmail.com> wrote:
>>>
>>> Okay, so after changing longitude/latitude to float4, and
>>> re-organizing the table a bit, I got the query down from about 8
>>> minutes to 40 seconds.
>>>
>>> The details are in the gist comments:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>>
>>> Now, just need to get performance another 2 orders of magnitude
>>> better. Can we make the index a bit more coarse grained, perhaps
>>> convert long/lat to integers or something, use a hilbert index, use a
>>> postgis index... ideas?
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>
>
>
> --
> TimescaleDB | Growth & Developer Evangelism
> c: 908.581.9509
>
> 335 Madison Ave.
> New York, NY 10017
> www.timescale.com
> https://github.com/timescale/timescaledb

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

21 April 2017, 00:45:29

Scott, Vick, the vast majority of the data is generic. But there are
some specific events we need to look up quickly which are probably
less than a few 100,000 records. We did evaluate partial indexes vs
full indexes. The partial index speeds up our specific queries
significantly while only taking a very small amount of space (often <
100MB, compared to a full index on the DB which is up around 20Gb at
the moment).

On 21 April 2017 at 03:01, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Thu, Apr 20, 2017 at 6:30 AM, Vick Khera <vivek@khera.org> wrote:
>> I'm curious why you have so many partial indexes. Are you trying to make
>> custom indexes per query? It seems to me you might want to consider making
>> the indexes general, and remove the redundant ones (that have the same
>> prefix list of indexed fields).
>>
>> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
>> rows does that take? I would further suggest that you partition this table
>> such that there are no more than about 10 million rows per partition (I've
>> done this by using a id % 100 computation). Maybe in your case it makes
>> sense to partition it based on the "what" field, because it appears you are
>> trying to do that with your partial indexes already.
>
> I would think a two field index might be just as effective and not
> require a lot of maintenance etc.
>
> --
> To understand recursion, one must first understand recursion.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

From

Andrew Staller

Date:

21 April 2017, 23:41:25

Samuel,

Short answer to your questions: (1) TimescaleDB and CitusDB are focusing on solving different problems, and (2) TimescaleDB is an Apache 2-licensed extension to run in your Postgres database, not a fork or different system.

Longer answer to your first question:

From what we've read and the conversations we've had with Citus (awesome folks, btw), they're primarily solving a different problem -- OLAP queries and distributed transactions -- while we are focused on time-series data analysis. As such, we haven't benchmarked against Citus and if we were to, it would require some unrealistic finagling in order to make it an apples-to-apples comparison.

As an example, their partitioning is based on a single primary key, while all writes in their clustered version go to a cluster master (although their hosted version of Citus MX changes this a bit). We perform two-dimensional partitioning using your primary key and time, with the partitions automatically created/closed on time intervals and data volume in order to keep the most recent partitioned chunks of data (and their B-trees) in memory to support fast ingest for time-series data. (***) We have only released our single-node version, but the clustering will allow you to write directly to any node. Citus has also decided only to support the SQL queries they are optimized for, while we've made the decision to support "full SQL" (even if some queries might not be fast).

*** We blogged yesterday about TimescaleDB's partitioning design choices in more depth, if you are interested:

https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c

On Thu, Apr 20, 2017 at 5:43 PM, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Andrew, how would timescaledb compare to citus - and is timescaledb an
extension to postgres or is it an entirely separate system?

On 21 April 2017 at 02:44, Andrew Staller <andrew@timescale.com> wrote:
> Awesome thread.
>
> Samuel,
>
> Just wanted you to be aware of the work we're doing at TimescaleDB
> (http://www.timescale.com/), a time-series database extension for
> PostgreSQL.
>
> Some of how we might help you:
> - automatic partitioning by space (primary key - like country_id, for
> instance) and time. This creates "chunks" of your data, right-sized by
> volume and time constraints (which makes inserts fast at much greater scale
> than Vanilla PostgreSQL - not sure if this is a problem for you)
> - this will also help if your queries are selective on time and country_id
> (per this example)
> - the partitioning by time allows you to DROP old chunks without the need
> for vacuums
>
> On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera <vivek@khera.org> wrote:
>>
>> I'm curious why you have so many partial indexes. Are you trying to make
>> custom indexes per query? It seems to me you might want to consider making
>> the indexes general, and remove the redundant ones (that have the same
>> prefix list of indexed fields).
>>
>> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
>> rows does that take? I would further suggest that you partition this table
>> such that there are no more than about 10 million rows per partition (I've
>> done this by using a id % 100 computation). Maybe in your case it makes
>> sense to partition it based on the "what" field, because it appears you are
>> trying to do that with your partial indexes already.
>>
>> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
>> <space.ship.traveller@gmail.com> wrote:
>>>
>>> Okay, so after changing longitude/latitude to float4, and
>>> re-organizing the table a bit, I got the query down from about 8
>>> minutes to 40 seconds.
>>>
>>> The details are in the gist comments:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>>
>>> Now, just need to get performance another 2 orders of magnitude
>>> better. Can we make the index a bit more coarse grained, perhaps
>>> convert long/lat to integers or something, use a hilbert index, use a
>>> postgis index... ideas?
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>
>
>
> --
> TimescaleDB | Growth & Developer Evangelism
> c: 908.581.9509
>
> 335 Madison Ave.
> New York, NY 10017
> www.timescale.com
> https://github.com/timescale/timescaledb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

TimescaleDB | Growth & Developer Evangelism

c: 908.581.9509

335 Madison Ave.

New York, NY 10017

www.timescale.com

https://github.com/timescale/timescaledb

Re: [GENERAL] Large data and slow queries

From

Samuel Williams

Date:

28 April 2017, 01:47:58

Citus is also now just an extension.

Sounds pretty awesome, I'll certainly consider your system if/when we
decide to make changes.

On 22 April 2017 at 08:41, Andrew Staller <andrew@timescale.com> wrote:
> Samuel,
>
> Short answer to your questions:  (1) TimescaleDB and CitusDB are focusing on
> solving different problems, and (2) TimescaleDB is an Apache 2-licensed
> extension to run in your Postgres database, not a fork or different system.
>
> Longer answer to your first question:
>
> From what we've read and the conversations we've had with Citus (awesome
> folks, btw), they're primarily solving a different problem -- OLAP queries
> and distributed transactions -- while we are focused on time-series data
> analysis.  As such, we haven't benchmarked against Citus and if we were to,
> it would require some unrealistic finagling in order to make it an
> apples-to-apples comparison.
>
> As an example, their partitioning is based on a single primary key, while
> all writes in their clustered version go to a cluster master (although their
> hosted version of Citus MX changes this a bit).  We perform two-dimensional
> partitioning using your primary key and time, with the partitions
> automatically created/closed on time intervals and data volume in order to
> keep the most recent partitioned chunks of data (and their B-trees) in
> memory to support fast ingest for time-series data. (***) We have only
> released our single-node version, but the clustering will allow you to write
> directly to any node.  Citus has also decided only to support the SQL
> queries they are optimized for, while we've made the decision to support
> "full SQL" (even if some queries might not be fast).
>
> *** We blogged yesterday about TimescaleDB's partitioning design choices in
> more depth, if you are interested:
> https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c
>
>
> On Thu, Apr 20, 2017 at 5:43 PM, Samuel Williams
> <space.ship.traveller@gmail.com> wrote:
>>
>> Andrew, how would timescaledb compare to citus - and is timescaledb an
>> extension to postgres or is it an entirely separate system?
>>
>> On 21 April 2017 at 02:44, Andrew Staller <andrew@timescale.com> wrote:
>> > Awesome thread.
>> >
>> > Samuel,
>> >
>> > Just wanted you to be aware of the work we're doing at TimescaleDB
>> > (http://www.timescale.com/), a time-series database extension for
>> > PostgreSQL.
>> >
>> > Some of how we might help you:
>> > - automatic partitioning by space (primary key - like country_id, for
>> > instance) and time. This creates "chunks" of your data, right-sized by
>> > volume and time constraints (which makes inserts fast at much greater
>> > scale
>> > than Vanilla PostgreSQL - not sure if this is a problem for you)
>> > - this will also help if your queries are selective on time and
>> > country_id
>> > (per this example)
>> > - the partitioning by time allows you to DROP old chunks without the
>> > need
>> > for vacuums
>> >
>> > On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera <vivek@khera.org> wrote:
>> >>
>> >> I'm curious why you have so many partial indexes. Are you trying to
>> >> make
>> >> custom indexes per query? It seems to me you might want to consider
>> >> making
>> >> the indexes general, and remove the redundant ones (that have the same
>> >> prefix list of indexed fields).
>> >>
>> >> Secondly your table is 102Gb. Clearly there's a lot of data here. How
>> >> many
>> >> rows does that take? I would further suggest that you partition this
>> >> table
>> >> such that there are no more than about 10 million rows per partition
>> >> (I've
>> >> done this by using a id % 100 computation). Maybe in your case it makes
>> >> sense to partition it based on the "what" field, because it appears you
>> >> are
>> >> trying to do that with your partial indexes already.
>> >>
>> >> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
>> >> <space.ship.traveller@gmail.com> wrote:
>> >>>
>> >>> Okay, so after changing longitude/latitude to float4, and
>> >>> re-organizing the table a bit, I got the query down from about 8
>> >>> minutes to 40 seconds.
>> >>>
>> >>> The details are in the gist comments:
>> >>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>> >>>
>> >>> Now, just need to get performance another 2 orders of magnitude
>> >>> better. Can we make the index a bit more coarse grained, perhaps
>> >>> convert long/lat to integers or something, use a hilbert index, use a
>> >>> postgis index... ideas?
>> >>>
>> >>>
>> >>> --
>> >>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> >>> To make changes to your subscription:
>> >>> http://www.postgresql.org/mailpref/pgsql-general
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > TimescaleDB | Growth & Developer Evangelism
>> > c: 908.581.9509
>> >
>> > 335 Madison Ave.
>> > New York, NY 10017
>> > www.timescale.com
>> > https://github.com/timescale/timescaledb
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>
> --
> TimescaleDB | Growth & Developer Evangelism
> c: 908.581.9509
>
> 335 Madison Ave.
> New York, NY 10017
> www.timescale.com
> https://github.com/timescale/timescaledb