Thread: Trigger Performance

Trigger Performance

From
Randall Smith
Date:
Hi,

I've created a trigger that checks the uniqueness of two columns in a
table.  Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable.  However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

The unique fields consist of a an indexed int8 (volume_id) and a text
field (name).  The average ratio of volume_id to name is 1 to 10,000.
The query I'm using to check uniqueness in the trigger is:

        ...
        IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
        volume_id = NEW.volume_id AND name = NEW.name)) THEN
            RAISE EXCEPTION '% already exists on volume', NEW.name;
        END IF;
        ...

This trigger is called only BEFORE INSERTS.

In testing, without the trigger, inserts are at 10,000 every 2 seconds.
With the trigger, they are 10,000 every 40 seconds.  The output of
explain analyze suggests that this shouldn't be taking so long.

    EXPLAIN ANALYZE SELECT EXISTS (
        SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat');

 Result  (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using volume_id_idx on t1  (cost=0.00..8.35 rows=1
width=0) (actual time=0.016..0.016 rows=0 loops=1)
           Index Cond: (volume_id = 300)
           Filter: (name = 'whodat'::text)
 Total runtime: 0.053 ms
(6 rows)

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records, far from the actual 38 seconds it is adding.  I've
tried to change up the query in the trigger to see if I could get
different results with not much luck.  Any idea what might be taking up
the extra time or what I can do to troubleshoot?

Thanks.

-Randall






Re: Trigger Performance

From
Alban Hertroys
Date:
On 15 Jan 2011, at 23:52, Randall Smith wrote:

> Hi,
>
> I've created a trigger that checks the uniqueness of two columns in a
> table.  Traditionally, one would use a unique constraint, but in my
> case, the size of the unique index would be too large and some
> performance loss is acceptable.

But you already do have an index on that id-field, so what's the problem with using a unique constraint? Its unique
indexcould just replace the existing one. 

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.


!DSPAM:737,4d323b1e11871071717716!



Re: Trigger Performance

From
Jasen Betts
Date:
On 2011-01-15, Randall Smith <randall@tnr.cc> wrote:
> Hi,
>
> I've created a trigger that checks the uniqueness of two columns in a
> table.  Traditionally, one would use a unique constraint, but in my
> case, the size of the unique index would be too large and some
> performance loss is acceptable.  However, the trigger performance seems
> to be far below what's indicated by an explain analyze of the query used
> in the trigger.
>
> The unique fields consist of a an indexed int8 (volume_id) and a text
> field (name).  The average ratio of volume_id to name is 1 to 10,000.
> The query I'm using to check uniqueness in the trigger is:
>
>         ...
>         IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
>         volume_id = NEW.volume_id AND name = NEW.name)) THEN
>             RAISE EXCEPTION '% already exists on volume', NEW.name;
>         END IF;
>         ...

In plpgsql IF is an implicit select.
<http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html>

         IF EXISTS (SELECT 1 FROM t1 WHERE
         volume_id = NEW.volume_id AND name = NEW.name) THEN
             RAISE EXCEPTION '% already exists on volume', NEW.name;
         END IF;

i don't know if that will help performance though

>  Result  (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
> rows=1 loops=1)
>    InitPlan 1 (returns $0)
>      ->  Index Scan using volume_id_idx on t1  (cost=0.00..8.35 rows=1
> width=0) (actual time=0.016..0.016 rows=0 loops=1)
>            Index Cond: (volume_id = 300)
>            Filter: (name = 'whodat'::text)
>  Total runtime: 0.053 ms
> (6 rows)

I got 0.4ms the first time I tried this.

that's actual runtime is helped by cache locality, reconnect and try
it again and you'll see a worse figure
( I got a factor of three difference)

at 53us ir probably didn't hit the hard disk, when that starts
happening things will get much worse.

> 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

huh?

> According to that stat, this lookup should be adding about 0.5 seconds
> to 10,000 records,

why?  what are you doing to 10000 records.

> far from the actual 38 seconds it is adding.  I've
> tried to change up the query in the trigger to see if I could get
> different results with not much luck.  Any idea what might be taking up
> the extra time or what I can do to troubleshoot?

Try it without that check (on sample data) and see how much faster it goes
(just comment out that part of the trigger)
when I tested that here I could not see a definate difference.
(timings were all over the place, some slower some faster)

how much disk (in bytes, and dollars) are you hoping to save by not
using the index.

--
⚂⚃ 100% natural

Re: Trigger Performance

From
Randall Smith
Date:
Before reading.  This is solved.  Was an error on my part.

On Sun, 2011-01-16 at 03:46 +0000, Jasen Betts wrote:
> In plpgsql IF is an implicit select.
> <http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html>
>
>          IF EXISTS (SELECT 1 FROM t1 WHERE
>          volume_id = NEW.volume_id AND name = NEW.name) THEN
>              RAISE EXCEPTION '% already exists on volume', NEW.name;
>          END IF;

Thanks.  Certainly more concise.

>
> > 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds
>
> huh?
>
> > According to that stat, this lookup should be adding about 0.5 seconds
> > to 10,000 records,
>
> why?  what are you doing to 10000 records.


Inserting them.  Sorry, that was a critical omission on my part.  The
trigger check is for inserts and I'm testing its performance by
inserting 10,000 records.

Turns out my EXPLAIN ANALYZE times were so low because I was querying
for a volume_id that had only a few rows.  When I query for a volume_id
with 10,000 rows, it changes to 7 ms, which matches the performance I'm
seeing.  That's acceptable to me because that's probably at the upper
end of what I'll see.  7 ms to check 10,000 text fields is actually
impressive to me.


>
> how much disk (in bytes, and dollars) are you hoping to save by not
> using the index.
>

I need to be able to fit my indexes in RAM.  This table will have a few
billion records and I have several other indexes with billions of
records and I'd like my DB to run well on a machine with 20G (preferred)
60G (max) RAM and not have to resort to sharding.  These text fields can
be up to 1k each.  A 1 billion row int8 index comes in around 2G.
Adding the text field to the index would probably put it at over 20G per
billion records.

Thanks.

-Randall


Re: Trigger Performance

From
Alban Hertroys
Date:
On 16 Jan 2011, at 5:38, Randall Smith wrote:

>> But you already do have an index on that id-field, so what's the problem with using a unique constraint? Its unique
indexcould just replace the existing one. 
>
> I'm OK with indexing the 8 byte integer, but I don't want to index the
> text field that could possibly be up to 1k.  In some simple tests,
> indexing both fields, the index was larger than the table.


Ah, it was about a different field - that makes much more sense :)

You probably don't need the full width of that field to be fairly certain that it's going to be unique from that point
onanyway. The chance that a unique constraint kicks in when it shouldn't gets lower the wider your index on that field
gets,but it's probably sufficiently low with relatively few characters, while it's probably not that big a deal if it
kicksin early in a few cases. 

For example, if you only index the first 100 bytes of each of these fields, you are very unlikely to run into a
situationwhere your constraint claims the field is not unique while it actually is. 

That shrinks your index down by a factor of 10, and IMO 100 bytes is still on the wide side for a unique text column. I
don'tthink many people successfully enter 100 characters in a row without ever making an error. 

If instead the field contains computer-generated data, then there surely is an easier way to detect that this text is
notgoing to be unique. You could store the values of the parameters used to generate that data for example, and put a
uniqueconstraint on those. 


I'd also wonder what the value is of two distinct records with the same 1000 characters of text up to the last one. Are
thosedistinct on purpose or is it more likely that someone made an error entering that field and accidentally managed
tocreate a unique entry while in fact the unique constraint was supposed to kick in? 

Besides that, you say there will be billions of records, but that text field needs to be unique? I really do hope the
contentsof that field will be computer-generated, or your users will get quite desperate coming up with new values to
useafter a while, especially if the contents need to make any sense. 

But of course much depends on the nature of the data in your field. You haven't said anything about that.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4d32bdf511764853411139!



Re: Trigger Performance

From
Tom Lane
Date:
Randall Smith <randall@tnr.cc> writes:
> I've created a trigger that checks the uniqueness of two columns in a
> table.  Traditionally, one would use a unique constraint, but in my
> case, the size of the unique index would be too large and some
> performance loss is acceptable.  However, the trigger performance seems
> to be far below what's indicated by an explain analyze of the query used
> in the trigger.

You realize of course that this is fundamentally broken and cannot be
trusted?  Worrying about the performance seems rather pointless.

The reason it can't be trusted is that two concurrent insertions will
neither see the other one.

            regards, tom lane

Re: Trigger Performance

From
Fredric Fredricson
Date:
On 01/15/2011 11:52 PM, Randall Smith wrote:
> Hi,
>
> I've created a trigger that checks the uniqueness of two columns in a
> table.  Traditionally, one would use a unique constraint, but in my
> case, the size of the unique index would be too large and some
> performance loss is acceptable.  However, the trigger performance seems
> to be far below what's indicated by an explain analyze of the query used
> in the trigger.
>
> The unique fields consist of a an indexed int8 (volume_id) and a text
> field (name).  The average ratio of volume_id to name is 1 to 10,000.
> The query I'm using to check uniqueness in the trigger is:
A possible work around could be to store an extra field that contains
the md5 hash of the text field and use that field in the index instead
of the text itself. The md5 sum could be calculated by your trigger.

/Fredric

>          ...
>          IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
>          volume_id = NEW.volume_id AND name = NEW.name)) THEN
>              RAISE EXCEPTION '% already exists on volume', NEW.name;
>          END IF;
>          ...
>
> This trigger is called only BEFORE INSERTS.
>
> In testing, without the trigger, inserts are at 10,000 every 2 seconds.
> With the trigger, they are 10,000 every 40 seconds.  The output of
> explain analyze suggests that this shouldn't be taking so long.
>
>      EXPLAIN ANALYZE SELECT EXISTS (
>          SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat');
>
>   Result  (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
> rows=1 loops=1)
>     InitPlan 1 (returns $0)
>       ->   Index Scan using volume_id_idx on t1  (cost=0.00..8.35 rows=1
> width=0) (actual time=0.016..0.016 rows=0 loops=1)
>             Index Cond: (volume_id = 300)
>             Filter: (name = 'whodat'::text)
>   Total runtime: 0.053 ms
> (6 rows)
>
> 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds
>
> According to that stat, this lookup should be adding about 0.5 seconds
> to 10,000 records, far from the actual 38 seconds it is adding.  I've
> tried to change up the query in the trigger to see if I could get
> different results with not much luck.  Any idea what might be taking up
> the extra time or what I can do to troubleshoot?
>
> Thanks.
>
> -Randall
>
>
>
>
>
>


Attachment

Re: Trigger Performance

From
Jasen Betts
Date:
On 2011-01-16, Randall Smith <randall@tnr.cc> wrote:
> Before reading.  This is solved.  Was an error on my part.
>
> On Sun, 2011-01-16 at 03:46 +0000, Jasen Betts wrote:
>> In plpgsql IF is an implicit select.
>> <http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html>
>>
>>          IF EXISTS (SELECT 1 FROM t1 WHERE
>>          volume_id = NEW.volume_id AND name = NEW.name) THEN
>>              RAISE EXCEPTION '% already exists on volume', NEW.name;
>>          END IF;
>
> Thanks.  Certainly more concise.
>
>>
>> > 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds
>>
>> huh?
>>
>> > According to that stat, this lookup should be adding about 0.5 seconds
>> > to 10,000 records,
>>
>> why?  what are you doing to 10000 records.
>
> Inserting them.  Sorry, that was a critical omission on my part.  The
> trigger check is for inserts and I'm testing its performance by
> inserting 10,000 records.

that'll have n.log(n) complexity (or worse) you can't simply multiply by
10000 because each row added to the index slows the average index lookup
time a little.

> Turns out my EXPLAIN ANALYZE times were so low because I was querying
> for a volume_id that had only a few rows.  When I query for a volume_id
> with 10,000 rows, it changes to 7 ms, which matches the performance I'm
> seeing.  That's acceptable to me because that's probably at the upper
> end of what I'll see.  7 ms to check 10,000 text fields is actually
> impressive to me.

>> how much disk (in bytes, and dollars) are you hoping to save by not
>> using the index.

> I need to be able to fit my indexes in RAM.

why? Indices that don't fit are still useful.  Doing 20 or so record
retreivals to confirm the absense of a record is likely to be slower
than doing 5 reads and a write or two to check and update a disk-based
index.

postgres isn't going to blindly load all the indices into ram.

> I need to be able to fit my indexes in RAM.  This table will have a few
> billion records These text fields can
> be up to 1k each.

this query will create about 300000 sample records with 1K text
repeat it with dufferent numbers in the first generate_series
 10001,20000
 20001,30000
 etc
 until you get 2 billion records

 then create your indices and do your testing.

insert into t1 select a::integer as volume_id,
  md5((a*b)::text) || md5((a*b+1)::text) || md5((a*b+3)::text)
  || md5((a*b+4)::text) || md5((a*b+5)::text) || md5((a*b+6)::text)
  || md5((a*b+7)::text) || md5((a*b+8)::text) || md5((a*b+9)::text)
  || md5((a*b+11)::text) || md5((a*b+12)::text) || md5((a*b+13)::text)
  || md5((a*b+14)::text) || md5((a*b+15)::text) as name from
(select generate_series( 1 ,10000) as a ,generate_series(1,29) as b) as foo;

--
⚂⚃ 100% natural

Re: Trigger Performance

From
Jasen Betts
Date:
On 2011-01-17, Fredric Fredricson <Fredric.Fredricson@bonetmail.com> wrote:
> This is a multi-part message in MIME format.
> --------------000600030501040304020000
> Content-Type: text/plain; charset=UTF-8; format=flowed
> Content-Transfer-Encoding: 7bit
>
> On 01/15/2011 11:52 PM, Randall Smith wrote:
>> Hi,
>>
>> I've created a trigger that checks the uniqueness of two columns in a
>> table.  Traditionally, one would use a unique constraint, but in my
>> case, the size of the unique index would be too large and some
>> performance loss is acceptable.  However, the trigger performance seems
>> to be far below what's indicated by an explain analyze of the query used
>> in the trigger.
>>
>> The unique fields consist of a an indexed int8 (volume_id) and a text
>> field (name).  The average ratio of volume_id to name is 1 to 10,000.
>> The query I'm using to check uniqueness in the trigger is:

> A possible work around could be to store an extra field that contains
> the md5 hash of the text field and use that field in the index instead
> of the text itself. The md5 sum could be calculated by your trigger.

you don't have to store it, only index on it.

create unique index t1_nodups on t1(  volume_id,md5(name));

--
⚂⚃ 100% natural