Thread: Trigger Performance
Hi, I've created a trigger that checks the uniqueness of two columns in a table. Traditionally, one would use a unique constraint, but in my case, the size of the unique index would be too large and some performance loss is acceptable. However, the trigger performance seems to be far below what's indicated by an explain analyze of the query used in the trigger. The unique fields consist of a an indexed int8 (volume_id) and a text field (name). The average ratio of volume_id to name is 1 to 10,000. The query I'm using to check uniqueness in the trigger is: ... IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE volume_id = NEW.volume_id AND name = NEW.name)) THEN RAISE EXCEPTION '% already exists on volume', NEW.name; END IF; ... This trigger is called only BEFORE INSERTS. In testing, without the trigger, inserts are at 10,000 every 2 seconds. With the trigger, they are 10,000 every 40 seconds. The output of explain analyze suggests that this shouldn't be taking so long. EXPLAIN ANALYZE SELECT EXISTS ( SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat'); Result (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020 rows=1 loops=1) InitPlan 1 (returns $0) -> Index Scan using volume_id_idx on t1 (cost=0.00..8.35 rows=1 width=0) (actual time=0.016..0.016 rows=0 loops=1) Index Cond: (volume_id = 300) Filter: (name = 'whodat'::text) Total runtime: 0.053 ms (6 rows) 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds According to that stat, this lookup should be adding about 0.5 seconds to 10,000 records, far from the actual 38 seconds it is adding. I've tried to change up the query in the trigger to see if I could get different results with not much luck. Any idea what might be taking up the extra time or what I can do to troubleshoot? Thanks. -Randall
On 15 Jan 2011, at 23:52, Randall Smith wrote: > Hi, > > I've created a trigger that checks the uniqueness of two columns in a > table. Traditionally, one would use a unique constraint, but in my > case, the size of the unique index would be too large and some > performance loss is acceptable. But you already do have an index on that id-field, so what's the problem with using a unique constraint? Its unique indexcould just replace the existing one. Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling. !DSPAM:737,4d323b1e11871071717716!
On 2011-01-15, Randall Smith <randall@tnr.cc> wrote: > Hi, > > I've created a trigger that checks the uniqueness of two columns in a > table. Traditionally, one would use a unique constraint, but in my > case, the size of the unique index would be too large and some > performance loss is acceptable. However, the trigger performance seems > to be far below what's indicated by an explain analyze of the query used > in the trigger. > > The unique fields consist of a an indexed int8 (volume_id) and a text > field (name). The average ratio of volume_id to name is 1 to 10,000. > The query I'm using to check uniqueness in the trigger is: > > ... > IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE > volume_id = NEW.volume_id AND name = NEW.name)) THEN > RAISE EXCEPTION '% already exists on volume', NEW.name; > END IF; > ... In plpgsql IF is an implicit select. <http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html> IF EXISTS (SELECT 1 FROM t1 WHERE volume_id = NEW.volume_id AND name = NEW.name) THEN RAISE EXCEPTION '% already exists on volume', NEW.name; END IF; i don't know if that will help performance though > Result (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020 > rows=1 loops=1) > InitPlan 1 (returns $0) > -> Index Scan using volume_id_idx on t1 (cost=0.00..8.35 rows=1 > width=0) (actual time=0.016..0.016 rows=0 loops=1) > Index Cond: (volume_id = 300) > Filter: (name = 'whodat'::text) > Total runtime: 0.053 ms > (6 rows) I got 0.4ms the first time I tried this. that's actual runtime is helped by cache locality, reconnect and try it again and you'll see a worse figure ( I got a factor of three difference) at 53us ir probably didn't hit the hard disk, when that starts happening things will get much worse. > 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds huh? > According to that stat, this lookup should be adding about 0.5 seconds > to 10,000 records, why? what are you doing to 10000 records. > far from the actual 38 seconds it is adding. I've > tried to change up the query in the trigger to see if I could get > different results with not much luck. Any idea what might be taking up > the extra time or what I can do to troubleshoot? Try it without that check (on sample data) and see how much faster it goes (just comment out that part of the trigger) when I tested that here I could not see a definate difference. (timings were all over the place, some slower some faster) how much disk (in bytes, and dollars) are you hoping to save by not using the index. -- ⚂⚃ 100% natural
Before reading. This is solved. Was an error on my part. On Sun, 2011-01-16 at 03:46 +0000, Jasen Betts wrote: > In plpgsql IF is an implicit select. > <http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html> > > IF EXISTS (SELECT 1 FROM t1 WHERE > volume_id = NEW.volume_id AND name = NEW.name) THEN > RAISE EXCEPTION '% already exists on volume', NEW.name; > END IF; Thanks. Certainly more concise. > > > 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds > > huh? > > > According to that stat, this lookup should be adding about 0.5 seconds > > to 10,000 records, > > why? what are you doing to 10000 records. Inserting them. Sorry, that was a critical omission on my part. The trigger check is for inserts and I'm testing its performance by inserting 10,000 records. Turns out my EXPLAIN ANALYZE times were so low because I was querying for a volume_id that had only a few rows. When I query for a volume_id with 10,000 rows, it changes to 7 ms, which matches the performance I'm seeing. That's acceptable to me because that's probably at the upper end of what I'll see. 7 ms to check 10,000 text fields is actually impressive to me. > > how much disk (in bytes, and dollars) are you hoping to save by not > using the index. > I need to be able to fit my indexes in RAM. This table will have a few billion records and I have several other indexes with billions of records and I'd like my DB to run well on a machine with 20G (preferred) 60G (max) RAM and not have to resort to sharding. These text fields can be up to 1k each. A 1 billion row int8 index comes in around 2G. Adding the text field to the index would probably put it at over 20G per billion records. Thanks. -Randall
On 16 Jan 2011, at 5:38, Randall Smith wrote: >> But you already do have an index on that id-field, so what's the problem with using a unique constraint? Its unique indexcould just replace the existing one. > > I'm OK with indexing the 8 byte integer, but I don't want to index the > text field that could possibly be up to 1k. In some simple tests, > indexing both fields, the index was larger than the table. Ah, it was about a different field - that makes much more sense :) You probably don't need the full width of that field to be fairly certain that it's going to be unique from that point onanyway. The chance that a unique constraint kicks in when it shouldn't gets lower the wider your index on that field gets,but it's probably sufficiently low with relatively few characters, while it's probably not that big a deal if it kicksin early in a few cases. For example, if you only index the first 100 bytes of each of these fields, you are very unlikely to run into a situationwhere your constraint claims the field is not unique while it actually is. That shrinks your index down by a factor of 10, and IMO 100 bytes is still on the wide side for a unique text column. I don'tthink many people successfully enter 100 characters in a row without ever making an error. If instead the field contains computer-generated data, then there surely is an easier way to detect that this text is notgoing to be unique. You could store the values of the parameters used to generate that data for example, and put a uniqueconstraint on those. I'd also wonder what the value is of two distinct records with the same 1000 characters of text up to the last one. Are thosedistinct on purpose or is it more likely that someone made an error entering that field and accidentally managed tocreate a unique entry while in fact the unique constraint was supposed to kick in? Besides that, you say there will be billions of records, but that text field needs to be unique? I really do hope the contentsof that field will be computer-generated, or your users will get quite desperate coming up with new values to useafter a while, especially if the contents need to make any sense. But of course much depends on the nature of the data in your field. You haven't said anything about that. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4d32bdf511764853411139!
Randall Smith <randall@tnr.cc> writes: > I've created a trigger that checks the uniqueness of two columns in a > table. Traditionally, one would use a unique constraint, but in my > case, the size of the unique index would be too large and some > performance loss is acceptable. However, the trigger performance seems > to be far below what's indicated by an explain analyze of the query used > in the trigger. You realize of course that this is fundamentally broken and cannot be trusted? Worrying about the performance seems rather pointless. The reason it can't be trusted is that two concurrent insertions will neither see the other one. regards, tom lane
On 01/15/2011 11:52 PM, Randall Smith wrote: > Hi, > > I've created a trigger that checks the uniqueness of two columns in a > table. Traditionally, one would use a unique constraint, but in my > case, the size of the unique index would be too large and some > performance loss is acceptable. However, the trigger performance seems > to be far below what's indicated by an explain analyze of the query used > in the trigger. > > The unique fields consist of a an indexed int8 (volume_id) and a text > field (name). The average ratio of volume_id to name is 1 to 10,000. > The query I'm using to check uniqueness in the trigger is: A possible work around could be to store an extra field that contains the md5 hash of the text field and use that field in the index instead of the text itself. The md5 sum could be calculated by your trigger. /Fredric > ... > IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE > volume_id = NEW.volume_id AND name = NEW.name)) THEN > RAISE EXCEPTION '% already exists on volume', NEW.name; > END IF; > ... > > This trigger is called only BEFORE INSERTS. > > In testing, without the trigger, inserts are at 10,000 every 2 seconds. > With the trigger, they are 10,000 every 40 seconds. The output of > explain analyze suggests that this shouldn't be taking so long. > > EXPLAIN ANALYZE SELECT EXISTS ( > SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat'); > > Result (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020 > rows=1 loops=1) > InitPlan 1 (returns $0) > -> Index Scan using volume_id_idx on t1 (cost=0.00..8.35 rows=1 > width=0) (actual time=0.016..0.016 rows=0 loops=1) > Index Cond: (volume_id = 300) > Filter: (name = 'whodat'::text) > Total runtime: 0.053 ms > (6 rows) > > 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds > > According to that stat, this lookup should be adding about 0.5 seconds > to 10,000 records, far from the actual 38 seconds it is adding. I've > tried to change up the query in the trigger to see if I could get > different results with not much luck. Any idea what might be taking up > the extra time or what I can do to troubleshoot? > > Thanks. > > -Randall > > > > > >
Attachment
On 2011-01-16, Randall Smith <randall@tnr.cc> wrote: > Before reading. This is solved. Was an error on my part. > > On Sun, 2011-01-16 at 03:46 +0000, Jasen Betts wrote: >> In plpgsql IF is an implicit select. >> <http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html> >> >> IF EXISTS (SELECT 1 FROM t1 WHERE >> volume_id = NEW.volume_id AND name = NEW.name) THEN >> RAISE EXCEPTION '% already exists on volume', NEW.name; >> END IF; > > Thanks. Certainly more concise. > >> >> > 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds >> >> huh? >> >> > According to that stat, this lookup should be adding about 0.5 seconds >> > to 10,000 records, >> >> why? what are you doing to 10000 records. > > Inserting them. Sorry, that was a critical omission on my part. The > trigger check is for inserts and I'm testing its performance by > inserting 10,000 records. that'll have n.log(n) complexity (or worse) you can't simply multiply by 10000 because each row added to the index slows the average index lookup time a little. > Turns out my EXPLAIN ANALYZE times were so low because I was querying > for a volume_id that had only a few rows. When I query for a volume_id > with 10,000 rows, it changes to 7 ms, which matches the performance I'm > seeing. That's acceptable to me because that's probably at the upper > end of what I'll see. 7 ms to check 10,000 text fields is actually > impressive to me. >> how much disk (in bytes, and dollars) are you hoping to save by not >> using the index. > I need to be able to fit my indexes in RAM. why? Indices that don't fit are still useful. Doing 20 or so record retreivals to confirm the absense of a record is likely to be slower than doing 5 reads and a write or two to check and update a disk-based index. postgres isn't going to blindly load all the indices into ram. > I need to be able to fit my indexes in RAM. This table will have a few > billion records These text fields can > be up to 1k each. this query will create about 300000 sample records with 1K text repeat it with dufferent numbers in the first generate_series 10001,20000 20001,30000 etc until you get 2 billion records then create your indices and do your testing. insert into t1 select a::integer as volume_id, md5((a*b)::text) || md5((a*b+1)::text) || md5((a*b+3)::text) || md5((a*b+4)::text) || md5((a*b+5)::text) || md5((a*b+6)::text) || md5((a*b+7)::text) || md5((a*b+8)::text) || md5((a*b+9)::text) || md5((a*b+11)::text) || md5((a*b+12)::text) || md5((a*b+13)::text) || md5((a*b+14)::text) || md5((a*b+15)::text) as name from (select generate_series( 1 ,10000) as a ,generate_series(1,29) as b) as foo; -- ⚂⚃ 100% natural
On 2011-01-17, Fredric Fredricson <Fredric.Fredricson@bonetmail.com> wrote: > This is a multi-part message in MIME format. > --------------000600030501040304020000 > Content-Type: text/plain; charset=UTF-8; format=flowed > Content-Transfer-Encoding: 7bit > > On 01/15/2011 11:52 PM, Randall Smith wrote: >> Hi, >> >> I've created a trigger that checks the uniqueness of two columns in a >> table. Traditionally, one would use a unique constraint, but in my >> case, the size of the unique index would be too large and some >> performance loss is acceptable. However, the trigger performance seems >> to be far below what's indicated by an explain analyze of the query used >> in the trigger. >> >> The unique fields consist of a an indexed int8 (volume_id) and a text >> field (name). The average ratio of volume_id to name is 1 to 10,000. >> The query I'm using to check uniqueness in the trigger is: > A possible work around could be to store an extra field that contains > the md5 hash of the text field and use that field in the index instead > of the text itself. The md5 sum could be calculated by your trigger. you don't have to store it, only index on it. create unique index t1_nodups on t1( volume_id,md5(name)); -- ⚂⚃ 100% natural