Thread: performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains

Hi All,

I have quite a few tables that follow a pattern like this:

          Table "public.my_model"
   Column |       Type        | Modifiers
--------+-------------------+-----------
   period | tsrange           | not null
   key    | character varying | not null
   value  | integer           |
Indexes:
      "my_model_pkey" PRIMARY KEY, btree (period, key)
      "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, key
WITH =)
Check constraints:
      "my_model_period_check" CHECK (period <> 'empty'::tsrange)

So, a primary key of a period column and one or more other columns
(usually int or string) and an exclude constraint to prevent overlaps,
and a check constraint to prevent empty ranges.

However, I'm hitting performance problems on moderate bulk inserts and
updates, with ~700k rows taking around 13 minutes. Profiling my python
code suggests that most of the time is being taken by Postgres (9.4 in
this case...)

What can I do to speed things up? Is there a different type of index I
can use to achieve the same exclude constraint? Is there something I can
do to have the index changes only done on the commit of the bulk batches?

cheers,

Chris


On 9/16/2016 2:01 AM, Chris Withers wrote:
> Hi All,
>
> I have quite a few tables that follow a pattern like this:
>
>          Table "public.my_model"
>   Column |       Type        | Modifiers
> --------+-------------------+-----------
>   period | tsrange           | not null
>   key    | character varying | not null
>   value  | integer           |
> Indexes:
>      "my_model_pkey" PRIMARY KEY, btree (period, key)
>      "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&,
> key WITH =)
> Check constraints:
>      "my_model_period_check" CHECK (period <> 'empty'::tsrange)
>
> So, a primary key of a period column and one or more other columns
> (usually int or string) and an exclude constraint to prevent overlaps,
> and a check constraint to prevent empty ranges.
>
> However, I'm hitting performance problems on moderate bulk inserts and
> updates, with ~700k rows taking around 13 minutes. Profiling my python
> code suggests that most of the time is being taken by Postgres (9.4 in
> this case...)
>
> What can I do to speed things up? Is there a different type of index I
> can use to achieve the same exclude constraint? Is there something I
> can do to have the index changes only done on the commit of the bulk
> batches?

if (period,key) is unique, by virtue of being the primary key, then
whats the point of the exclusion ??

I'm curious, how fast do your insert/updates run if you remove the key
exclusion and check constraint ?      tsvector operations are a lot more
complicated than simple matches in indexing....




--
john r pierce, recycling bits in santa cruz



On 9/16/2016 2:12 AM, John R Pierce wrote:
>   Column |       Type        | Modifiers
> --------+-------------------+-----------
>   period | tsrange           | not null

wait, what is a tsrange?   the standard textsearch data types in
postgres are tsvector and tsquery,


--
john r pierce, recycling bits in santa cruz



On 9/16/2016 2:23 AM, John R Pierce wrote:
>
> wait, what is a tsrange?   the standard textsearch data types in
> postgres are tsvector and tsquery,

never mind,  I should have known, its a timestamp range.   ...


when you do updates, are you changing any of the indexed fields, or just
"value" ?




On 16/09/2016 10:26, John R Pierce wrote:
> On 9/16/2016 2:23 AM, John R Pierce wrote:
>>
>> wait, what is a tsrange?   the standard textsearch data types in
>> postgres are tsvector and tsquery,
>
> never mind,  I should have known, its a timestamp range.   ...
>
>
> when you do updates, are you changing any of the indexed fields, or
> just "value" ?
Yeah, it's a temporal table, so "updates" involve modifying the period
column for a row to set its end ts, and then inserting a new row with a
start ts running on from that.

Of course, the adds are just inserting new rows.

cheers,

Chris


On 9/16/2016 3:46 AM, Chris Withers wrote:

when you do updates, are you changing any of the indexed fields, or
just "value" ?
Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that.

thats expensive, as it has to reindex that row.   and range indexes are more expensive than timestamp indexes

modifiyng the primary key is kind of a violation of one of the basic rules of relational databases as it means the row can't be referenced by another table.

I expect the expensive one is the constraint that ensures no periods overlap for the given key.    I'm not sure how that can be done short of a full scan for each update/insert.   it might actually perform better if you write the index with the key first as presumably the key is invariant ?



-- 
john r pierce, recycling bits in santa cruz
On 16/09/2016 12:00, John R Pierce wrote:
> On 9/16/2016 3:46 AM, Chris Withers wrote:
>>>
>>> when you do updates, are you changing any of the indexed fields, or
>>> just "value" ?
>> Yeah, it's a temporal table, so "updates" involve modifying the period
>> column for a row to set its end ts, and then inserting a new row with
>> a start ts running on from that.
>
> thats expensive, as it has to reindex that row.   and range indexes are
> more expensive than timestamp indexes
>
> modifiyng the primary key is kind of a violation of one of the basic
> rules of relational databases as it means the row can't be referenced by
> another table.

Right, but these rows have no natural primary key. Would it help if I
just added an auto-incrementing integer key? Would that make a positive
difference or would it just be a wasted column?

> I expect the expensive one is the constraint that ensures no periods
> overlap for the given key.    I'm not sure how that can be done short of
> a full scan for each update/insert.

Indeed, I wonder if making the constraint deferrable might help for the
bulk case?

> it might actually perform better
> if you write the index with the key first as presumably the key is
> invariant ?

You mean:

PRIMARY KEY, btree (key1, key2, period)

as opposed to

PRIMARY KEY, btree (period, key)

Interesting, I'd assumed postgres would optimise that under the covers...

Chris


-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chris Withers
Sent: Friday, September 16, 2016 6:47 AM
To: John R Pierce <pierce@hogranch.com>; pgsql-general@postgresql.org
Subject: Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains

On 16/09/2016 10:26, John R Pierce wrote:
> On 9/16/2016 2:23 AM, John R Pierce wrote:
>>
>> wait, what is a tsrange?   the standard textsearch data types in
>> postgres are tsvector and tsquery,
>
> never mind,  I should have known, its a timestamp range.   ...
>
>
> when you do updates, are you changing any of the indexed fields, or 
> just "value" ?
Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then
insertinga new row with a start ts running on from that.
 

Of course, the adds are just inserting new rows.

cheers,

Chris

____________________________________________________________________________________________________

So, what is the value for "end ts", when the record is inserted (the range just started)?

Regards,
Igor Neyman

On 16/09/2016 14:54, Igor Neyman wrote:
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chris Withers
> Sent: Friday, September 16, 2016 6:47 AM
> To: John R Pierce <pierce@hogranch.com>; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
>
> On 16/09/2016 10:26, John R Pierce wrote:
>> On 9/16/2016 2:23 AM, John R Pierce wrote:
>>>
>>> wait, what is a tsrange?   the standard textsearch data types in
>>> postgres are tsvector and tsquery,
>>
>> never mind,  I should have known, its a timestamp range.   ...
>>
>>
>> when you do updates, are you changing any of the indexed fields, or
>> just "value" ?
> Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then
insertinga new row with a start ts running on from that. 
>
> Of course, the adds are just inserting new rows.
>
> So, what is the value for "end ts", when the record is inserted (the range just started)?

It's open ended, so the period is [start_ts, )

Chris


Chris Withers <chris@simplistix.co.uk> writes:
> On 16/09/2016 14:54, Igor Neyman wrote:
>> So, what is the value for "end ts", when the record is inserted (the range just started)?

> It's open ended, so the period is [start_ts, )

I've not looked at the GiST range opclass, but I would not be surprised if
having lots of those is pretty destructive to the index's ability to be
selective about && searches.

            regards, tom lane


On 16/09/2016 12:00, John R Pierce wrote:
> On 9/16/2016 3:46 AM, Chris Withers wrote:
>>>
>>> when you do updates, are you changing any of the indexed fields, or
>>> just "value" ?
>> Yeah, it's a temporal table, so "updates" involve modifying the period
>> column for a row to set its end ts, and then inserting a new row with
>> a start ts running on from that.
>
> thats expensive, as it has to reindex that row.   and range indexes are
> more expensive than timestamp indexes
>
> modifiyng the primary key is kind of a violation of one of the basic
> rules of relational databases as it means the row can't be referenced by
> another table.

Right, but these rows have no natural primary key. Would it help if I
just added an auto-incrementing integer key? Would that make a positive
difference or would it just be a wasted column?

> I expect the expensive one is the constraint that ensures no periods
> overlap for the given key.    I'm not sure how that can be done short of
> a full scan for each update/insert.

Indeed, I wonder if making the constraint deferrable might help for the
bulk case?

> it might actually perform better
> if you write the index with the key first as presumably the key is
> invariant ?

You mean:

PRIMARY KEY, btree (period, key) as opposed to
>
>
>
> --
> john r pierce, recycling bits in santa cruz
>


On 16/09/2016 10:26, John R Pierce wrote:
> On 9/16/2016 2:23 AM, John R Pierce wrote:
>>
>> wait, what is a tsrange?   the standard textsearch data types in
>> postgres are tsvector and tsquery,
>
> never mind,  I should have known, its a timestamp range.   ...
>
>
> when you do updates, are you changing any of the indexed fields, or
> just "value" ?
Yeah, it's a temporal table, so "updates" involve modifying the period
column for a row to set its end ts, and then inserting a new row with a
start ts running on from that.

Of course, the adds are just inserting new rows.

cheers,

Chris


On 16/09/2016 15:29, Tom Lane wrote:
> Chris Withers <chris@simplistix.co.uk> writes:
>> On 16/09/2016 14:54, Igor Neyman wrote:
>>> So, what is the value for "end ts", when the record is inserted (the range just started)?
>
>> It's open ended, so the period is [start_ts, )
>
> I've not looked at the GiST range opclass, but I would not be surprised if
> having lots of those is pretty destructive to the index's ability to be
> selective about && searches.

If that's so, that's a little disappointing...
(I'd have thought the special case end value (open ended) and the ending
type (inclusive/exclusive) would just be sentinel values)

How would I verify your suspicions?

cheers,

Chris


On Fri, Sep 16, 2016 at 2:01 AM, Chris Withers <chris@simplistix.co.uk> wrote:
Hi All,

I have quite a few tables that follow a pattern like this:

         Table "public.my_model"
  Column |       Type        | Modifiers
--------+-------------------+-----------
  period | tsrange           | not null
  key    | character varying | not null
  value  | integer           |
Indexes:
     "my_model_pkey" PRIMARY KEY, btree (period, key)
     "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, key WITH =)
Check constraints:
     "my_model_period_check" CHECK (period <> 'empty'::tsrange)

Try swapping the order of the columns in the exclude constraint.  You want the more selective criterion to appear first in the index/constraint.  Presumably "key with =" is the most selective, especially if many of your periods are unbounded.

Cheers,

Jeff
Jeff Janes wrote
> Try swapping the order of the columns in the exclude constraint.  You want
> the more selective criterion to appear first in the index/constraint.
> Presumably "key with =" is the most selective, especially if many of your
> periods are unbounded.

I would not be so sure with that:
http://use-the-index-luke.com/sql/myth-directory/most-selective-first




--
View this message in context:
http://postgresql.nabble.com/performance-problems-with-bulk-inserts-updates-on-tsrange-with-gist-based-exclude-constrains-tp5921498p5922219.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


On Wed, Sep 21, 2016 at 2:18 PM, pinker <pinker@onet.eu> wrote:
Jeff Janes wrote
> Try swapping the order of the columns in the exclude constraint.  You want
> the more selective criterion to appear first in the index/constraint.
> Presumably "key with =" is the most selective, especially if many of your
> periods are unbounded.

I would not be so sure with that:

As a rule, I generally don't spout random nonsense. Or at least, not without including a disclaimer.  I didn't test it on the OPs exact case, because he has need blessed us with his data or his scripts.  But I have tested it on other data, and it does work.
 
http://use-the-index-luke.com/sql/myth-directory/most-selective-first

I don't see how anything there applies to GiST indexes.  Indeed, there doesn't seem to be much there worth reading at all.  The only thing vaguely informative, other than trivia about other RDBMS, is "It’s useless to have the most selective column of the index on the left if very few queries filter on it", which is rather obvious, but also obviously does not apply to this case.

Cheers,

Jeff
Hey Tom,

I appreciate you're busy, but did you ever get a chance to look at this?

On 19/09/2016 08:40, Chris Withers wrote:
> On 16/09/2016 15:29, Tom Lane wrote:
>> Chris Withers <chris@simplistix.co.uk> writes:
>>> On 16/09/2016 14:54, Igor Neyman wrote:
>>>> So, what is the value for "end ts", when the record is inserted (the
>>>> range just started)?
>>
>>> It's open ended, so the period is [start_ts, )
>>
>> I've not looked at the GiST range opclass, but I would not be
>> surprised if
>> having lots of those is pretty destructive to the index's ability to be
>> selective about && searches.
>
> If that's so, that's a little disappointing...
> (I'd have thought the special case end value (open ended) and the ending
> type (inclusive/exclusive) would just be sentinel values)
>
> How would I verify your suspicions?

cheers,

Chris


Hey Tom,

I appreciate you're busy, but did you ever get a chance to look at this?

On 19/09/2016 08:40, Chris Withers wrote:
> On 16/09/2016 15:29, Tom Lane wrote:
>> Chris Withers <chris@simplistix.co.uk> writes:
>>> On 16/09/2016 14:54, Igor Neyman wrote:
>>>> So, what is the value for "end ts", when the record is inserted (the
>>>> range just started)?
>>
>>> It's open ended, so the period is [start_ts, )
>>
>> I've not looked at the GiST range opclass, but I would not be
>> surprised if
>> having lots of those is pretty destructive to the index's ability to be
>> selective about && searches.
>
> If that's so, that's a little disappointing...
> (I'd have thought the special case end value (open ended) and the ending
> type (inclusive/exclusive) would just be sentinel values)
>
> How would I verify your suspicions?

cheers,

Chris