Thread: Regex performance issue
Hello.. I have a low performance problem with regexp. Here are the details: asterisk=> explain analyze SELECT * FROM destlist WHERE '0039051248787' ~ prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=7925.07..7925.15 rows=31 width=67) (actual time=857.715..857.716 rows=2 loops=1) Sort Key: length((prefix)::text) -> Bitmap Heap Scan on destlist (cost=60.16..7924.30 rows=31 width=67) (actual time=2.156..857.686 rows=2 loops=1) Recheck Cond: ((id_ent = -2) AND (dir = 0)) Filter: ('0039051248787'::text ~ (prefix)::text) -> Bitmap Index Scan on destlist_indx2 (cost=0.00..60.16 rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1) Index Cond: ((id_ent = -2) AND (dir = 0)) Total runtime: 857.804 ms (8 rows) The main problem is the query time. As you can see , the use of index destlist_indx2 is pretty quick (1.9 ms) , but the regexp operation takes a lot of time (857 ms). How can i improve that ? Regards Alex PS: Additional info: [root@voce1 billing]# uname -a Linux voce1 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:16:33 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux [root@voce1 billing]# free total used free shared buffers cached Mem: 1023912 977788 46124 0 80900 523868 -/+ buffers/cache: 373020 650892 Swap: 3172728 8488 3164240 Welcome to psql 8.1.2, the PostgreSQL interactive terminal. asterisk=> \d destlist; Table "public.destlist" Column | Type | Modifiers ---------+------------------------+----------------------------------------------------------- id | bigint | not null default nextval(('destlist_id'::text)::regclass) id_ent | integer | dir | integer | prefix | character varying(255) | country | character varying(255) | network | character varying(255) | tip | integer | Indexes: "destlist_unique" UNIQUE, btree (id_ent, dir, prefix) "destlist_indx2" btree (id_ent, dir) "destlist_indx3" btree (id_ent, dir, prefix) "mmumu" btree (prefix varchar_pattern_ops) asterisk=> select count(*) from destlist; count -------- 576424 (1 row) [root@voce1 billing]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 3 cpu MHz : 2992.658 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr bogomips : 5914.62 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 3 cpu MHz : 2992.658 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl cid cx16 xtpr bogomips : 5980.16 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management:
> -----Original Message----- > From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru Coseru > asterisk=> explain analyze SELECT * FROM destlist WHERE > '0039051248787' ~ > prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC; > > > QUERY PLAN > -------------------------------------------------------------- > ---------------------------------------------------------------------- > Sort (cost=7925.07..7925.15 rows=31 width=67) (actual > time=857.715..857.716 rows=2 loops=1) > Sort Key: length((prefix)::text) > -> Bitmap Heap Scan on destlist (cost=60.16..7924.30 > rows=31 width=67) > (actual time=2.156..857.686 rows=2 loops=1) > Recheck Cond: ((id_ent = -2) AND (dir = 0)) > Filter: ('0039051248787'::text ~ (prefix)::text) > -> Bitmap Index Scan on destlist_indx2 (cost=0.00..60.16 > rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1) > Index Cond: ((id_ent = -2) AND (dir = 0)) > Total runtime: 857.804 ms > (8 rows) > > > "mmumu" btree (prefix varchar_pattern_ops) > I'm surpised Postgres isn't using the index on prefix seeing as the index uses the varchar_pattern_ops operator class. It could be that the index isn't selective enough, or is Postgres not able to use an index with Posix regular expressions? The docs seem to say that it can, but I'd be curious to see what happens if you use LIKE instead of ~. Dave
Hello... I cannot use LIKE , because the order of the match is reversed. The prefix column is containing telephone destinations. IE: ^001 - US , ^0039 Italy , etc.. Here is a small sample: asterisk=> select * from destlist LIMIT 10; id | id_ent | dir | prefix | country | network | tip ----+--------+-----+------------+-------------+--------------------+----- 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 Now , I have to match a dialednumber (let's say 00213618833) and find it's destination...(It's algeria mobile). I tried to make with a query of using LIKE , but i was not able to get something.. Regards Alex ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com> To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>; <pgsql-performance@postgresql.org> Sent: Saturday, December 02, 2006 10:36 PM Subject: RE: [PERFORM] Regex performance issue > -----Original Message----- > From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru Coseru > asterisk=> explain analyze SELECT * FROM destlist WHERE > '0039051248787' ~ > prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC; > > > QUERY PLAN > -------------------------------------------------------------- > ---------------------------------------------------------------------- > Sort (cost=7925.07..7925.15 rows=31 width=67) (actual > time=857.715..857.716 rows=2 loops=1) > Sort Key: length((prefix)::text) > -> Bitmap Heap Scan on destlist (cost=60.16..7924.30 > rows=31 width=67) > (actual time=2.156..857.686 rows=2 loops=1) > Recheck Cond: ((id_ent = -2) AND (dir = 0)) > Filter: ('0039051248787'::text ~ (prefix)::text) > -> Bitmap Index Scan on destlist_indx2 (cost=0.00..60.16 > rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1) > Index Cond: ((id_ent = -2) AND (dir = 0)) > Total runtime: 857.804 ms > (8 rows) > > > "mmumu" btree (prefix varchar_pattern_ops) > I'm surpised Postgres isn't using the index on prefix seeing as the index uses the varchar_pattern_ops operator class. It could be that the index isn't selective enough, or is Postgres not able to use an index with Posix regular expressions? The docs seem to say that it can, but I'd be curious to see what happens if you use LIKE instead of ~. Dave -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
I may miss something but I'd use tsearch2. Check intdict dictionary for basic idea - http://www.sai.msu.su/~megera/wiki/Gendict Oleg On Sat, 2 Dec 2006, Alexandru Coseru wrote: > Hello... > > I cannot use LIKE , because the order of the match is reversed. > The prefix column is containing telephone destinations. > IE: ^001 - US , ^0039 Italy , etc.. > > Here is a small sample: > > asterisk=> select * from destlist LIMIT 10; > id | id_ent | dir | prefix | country | network | tip > ----+--------+-----+------------+-------------+--------------------+----- > 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 > 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 > 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 > 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 > 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 > 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 > 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 > 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 > 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 > 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 > > > Now , I have to match a dialednumber (let's say 00213618833) and find > it's destination...(It's algeria mobile). > I tried to make with a query of using LIKE , but i was not able to get > something.. > > > Regards > Alex > > > > > > ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com> > To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>; > <pgsql-performance@postgresql.org> > Sent: Saturday, December 02, 2006 10:36 PM > Subject: RE: [PERFORM] Regex performance issue > > >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru Coseru >> asterisk=> explain analyze SELECT * FROM destlist WHERE >> '0039051248787' ~ >> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC; >> >> >> QUERY PLAN >> -------------------------------------------------------------- >> ---------------------------------------------------------------------- >> Sort (cost=7925.07..7925.15 rows=31 width=67) (actual >> time=857.715..857.716 rows=2 loops=1) >> Sort Key: length((prefix)::text) >> -> Bitmap Heap Scan on destlist (cost=60.16..7924.30 >> rows=31 width=67) >> (actual time=2.156..857.686 rows=2 loops=1) >> Recheck Cond: ((id_ent = -2) AND (dir = 0)) >> Filter: ('0039051248787'::text ~ (prefix)::text) >> -> Bitmap Index Scan on destlist_indx2 (cost=0.00..60.16 >> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1) >> Index Cond: ((id_ent = -2) AND (dir = 0)) >> Total runtime: 857.804 ms >> (8 rows) >> >> >> "mmumu" btree (prefix varchar_pattern_ops) >> > > I'm surpised Postgres isn't using the index on prefix seeing as the index > uses the varchar_pattern_ops operator class. It could be that the index > isn't selective enough, or is Postgres not able to use an index with Posix > regular expressions? The docs seem to say that it can, but I'd be curious > to see what happens if you use LIKE instead of ~. > > Dave > > > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Alexandru Coseru wrote: > I cannot use LIKE , because the order of the match is reversed. > The prefix column is containing telephone destinations. > IE: ^001 - US , ^0039 Italy , etc.. Maybe you could create a functional index on substr(<minimum length of prefix>)? It might restrict the result set prior to applying the regex just enough to make the performance acceptable. > asterisk=> select * from destlist LIMIT 10; > id | id_ent | dir | prefix | country | network | tip > ----+--------+-----+------------+-------------+--------------------+----- > 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 > 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 > 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 > 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 > 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 > 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 > 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 > 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 > 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 > 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 > > Now , I have to match a dialednumber (let's say 00213618833) and find it's destination...(It's algeria mobile). > I tried to make with a query of using LIKE , but i was not able to get something.. Another idea would be to add some extra rows so that you could use normal inequality searches. For example, let's take the Albanian rows: 3 | -1 | 0 | 00355 4 | -1 | 0 | 0035538 * 3 | -1 | 0 | 0035539 5 | -1 | 0 | 0035568 6 | -1 | 0 | 0035569 * 3 | -1 | 0 | 0035570 Now you can do "SELECT * FROM destlist WHERE ? >= prefix ORDER BY prefix LIMIT 1". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hello.. I cannot use the first advice , because i'm not aware of the prefix length in the database... This is why i'm ordering after length(prefix).. On the 2nd one , i'm not sure that i can follow you.. Regards Alex ----- Original Message ----- From: "Heikki Linnakangas" <heikki@enterprisedb.com> To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> Cc: "Dave Dutcher" <dave@tridecap.com>; <pgsql-performance@postgresql.org> Sent: Sunday, December 03, 2006 12:04 AM Subject: Re: [PERFORM] Regex performance issue > Alexandru Coseru wrote: >> I cannot use LIKE , because the order of the match is reversed. >> The prefix column is containing telephone destinations. >> IE: ^001 - US , ^0039 Italy , etc.. > > Maybe you could create a functional index on substr(<minimum length of > prefix>)? It might restrict the result set prior to applying the regex > just enough to make the performance acceptable. > >> asterisk=> select * from destlist LIMIT 10; >> id | id_ent | dir | prefix | country | network | tip >> ----+--------+-----+------------+-------------+--------------------+----- >> 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 >> 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 >> 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 >> 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 >> 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 >> 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 >> 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 >> 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 >> 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 >> 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 >> >> Now , I have to match a dialednumber (let's say 00213618833) and find >> it's destination...(It's algeria mobile). >> I tried to make with a query of using LIKE , but i was not able to get >> something.. > > Another idea would be to add some extra rows so that you could use normal > inequality searches. For example, let's take the Albanian rows: > > 3 | -1 | 0 | 00355 > 4 | -1 | 0 | 0035538 > * 3 | -1 | 0 | 0035539 > 5 | -1 | 0 | 0035568 > 6 | -1 | 0 | 0035569 > * 3 | -1 | 0 | 0035570 > > Now you can do "SELECT * FROM destlist WHERE ? >= prefix ORDER BY prefix > LIMIT 1". > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006 > >
Alexandru Coseru wrote: > Hello.. > > I cannot use the first advice , because i'm not aware of the prefix > length in the database... > This is why i'm ordering after length(prefix).. > > > On the 2nd one , i'm not sure that i can follow you.. Ok, let me try again :) > asterisk=> select * from destlist LIMIT 10; > id | id_ent | dir | prefix | country | network | tip > ----+--------+-----+------------+-------------+--------------------+----- > 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 > 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 > 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 > 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 > 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 > 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 > 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 > 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 > 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 > 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 Store the prefix in a character column, without the regex stuff. Like below. I've removed the columns that are not relevant, in fact it would make sense to store them in another table, and have just the id and prefix in this table. id | prefix | network ---+---------+-------------------- 1 | 0093 | AFGHANISTAN 2 | 00937 | AFGHANISTAN Mobile 3 | 00355 | ALBANIA 4 | 0035538 | ALBANIA Mobile 5 | 0035568 | ALBANIA Mobile 6 | 0035569 | ALBANIA Mobile 7 | 00213 | ALGERIA 8 | 0021361 | ALGERIA Mobile 9 | 0021362 | ALGERIA Mobile 10 | 0021363 | ALGERIA Mobile Now, add the rows marked with start below: id | prefix | network ----+---------+-------------------- 1 | 0093 | AFGHANISTAN 2 | 00937 | AFGHANISTAN Mobile * 1 | 00938 | AFGHANISTAN 3 | 00355 | ALBANIA 4 | 0035538 | ALBANIA Mobile * 3 | 0035539 | ALBANIA 5 | 0035568 | ALBANIA Mobile 6 | 0035569 | ALBANIA Mobile * 3 | 003557 | ALBANIA 7 | 00213 | ALGERIA 8 | 0021361 | ALGERIA Mobile 9 | 0021362 | ALGERIA Mobile 10 | 0021363 | ALGERIA Mobile * 7 | 0021364 | ALGERIA The added rows make it unnecessary to use regex for the searches. You can use just the >= operator like this: (using the example number you gave) SELECT id FROM destlist WHERE '00213618833' >= prefix ORDER BY prefix LIMIT 1 Which would return id 7. As another example, a query of "00213654321" would match the last row, which again has id 7. I'm too tired to figure out the exact algorithm for adding the rows, but I'm pretty sure it can be automated... The basic idea is that when there's a row with id A and prefix XXXX, and another row with id B and prefix XXXXY, we add another row with id A and prefix XXXX(Y+1). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hello.. Thanks for the tip , i think i have got the ideea.. I'm too tired too , and i will try it tommorow. Anyway , anybody has a clue why this regex is that CPU intensive ? I did not saw the light on my drives blinking , and also vmstat doesn't yeld any blocks in or out... And how can it be optimized ? Is there a way to trace the system calls ? strace doesn't give me anything else but some lseeks and reads... PS: Tried it with a 8.2 snaphsot and the result is the same.. Regards Alex ----- Original Message ----- From: "Heikki Linnakangas" <heikki@enterprisedb.com> To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> Cc: <pgsql-performance@postgresql.org> Sent: Sunday, December 03, 2006 12:35 AM Subject: Re: [PERFORM] Regex performance issue > Alexandru Coseru wrote: >> Hello.. >> >> I cannot use the first advice , because i'm not aware of the prefix >> length in the database... >> This is why i'm ordering after length(prefix).. >> >> >> On the 2nd one , i'm not sure that i can follow you.. > > Ok, let me try again :) > >> asterisk=> select * from destlist LIMIT 10; >> id | id_ent | dir | prefix | country | network | tip >> ----+--------+-----+------------+-------------+--------------------+----- >> 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 >> 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 >> 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 >> 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 >> 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 >> 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 >> 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 >> 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 >> 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 >> 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 > > Store the prefix in a character column, without the regex stuff. Like > below. I've removed the columns that are not relevant, in fact it would > make sense to store them in another table, and have just the id and prefix > in this table. > > id | prefix | network > ---+---------+-------------------- > 1 | 0093 | AFGHANISTAN > 2 | 00937 | AFGHANISTAN Mobile > 3 | 00355 | ALBANIA > 4 | 0035538 | ALBANIA Mobile > 5 | 0035568 | ALBANIA Mobile > 6 | 0035569 | ALBANIA Mobile > 7 | 00213 | ALGERIA > 8 | 0021361 | ALGERIA Mobile > 9 | 0021362 | ALGERIA Mobile > 10 | 0021363 | ALGERIA Mobile > > Now, add the rows marked with start below: > > id | prefix | network > ----+---------+-------------------- > 1 | 0093 | AFGHANISTAN > 2 | 00937 | AFGHANISTAN Mobile > * 1 | 00938 | AFGHANISTAN > 3 | 00355 | ALBANIA > 4 | 0035538 | ALBANIA Mobile > * 3 | 0035539 | ALBANIA > 5 | 0035568 | ALBANIA Mobile > 6 | 0035569 | ALBANIA Mobile > * 3 | 003557 | ALBANIA > 7 | 00213 | ALGERIA > 8 | 0021361 | ALGERIA Mobile > 9 | 0021362 | ALGERIA Mobile > 10 | 0021363 | ALGERIA Mobile > * 7 | 0021364 | ALGERIA > > The added rows make it unnecessary to use regex for the searches. You can > use just the >= operator like this: (using the example number you gave) > > SELECT id FROM destlist WHERE '00213618833' >= prefix ORDER BY prefix > LIMIT 1 > > Which would return id 7. As another example, a query of "00213654321" > would match the last row, which again has id 7. > > I'm too tired to figure out the exact algorithm for adding the rows, but > I'm pretty sure it can be automated... The basic idea is that when there's > a row with id A and prefix XXXX, and another row with id B and prefix > XXXXY, we add another row with id A and prefix XXXX(Y+1). > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006 >
Hello.. I have never used tsearch2 , but at a first glance , i would not see any major improvement , because the main advantage of tsearch is the splitting in words of a phrase.. But here , i only have one word (no spaces). Regards Alex ----- Original Message ----- From: "Oleg Bartunov" <oleg@sai.msu.su> To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> Cc: "Dave Dutcher" <dave@tridecap.com>; <pgsql-performance@postgresql.org> Sent: Saturday, December 02, 2006 10:54 PM Subject: Re: [PERFORM] Regex performance issue >I may miss something but I'd use tsearch2. Check > intdict dictionary for basic idea - > http://www.sai.msu.su/~megera/wiki/Gendict > > Oleg > On Sat, 2 Dec 2006, Alexandru Coseru wrote: > >> Hello... >> >> I cannot use LIKE , because the order of the match is reversed. >> The prefix column is containing telephone destinations. >> IE: ^001 - US , ^0039 Italy , etc.. >> >> Here is a small sample: >> >> asterisk=> select * from destlist LIMIT 10; >> id | id_ent | dir | prefix | country | network | tip >> ----+--------+-----+------------+-------------+--------------------+----- >> 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 >> 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 >> 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 >> 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 >> 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 >> 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 >> 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 >> 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 >> 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 >> 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 >> >> >> Now , I have to match a dialednumber (let's say 00213618833) and find >> it's destination...(It's algeria mobile). >> I tried to make with a query of using LIKE , but i was not able to get >> something.. >> >> >> Regards >> Alex >> >> >> >> >> >> ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com> >> To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>; >> <pgsql-performance@postgresql.org> >> Sent: Saturday, December 02, 2006 10:36 PM >> Subject: RE: [PERFORM] Regex performance issue >> >> >>> -----Original Message----- >>> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru >>> Coseru >>> asterisk=> explain analyze SELECT * FROM destlist WHERE >>> '0039051248787' ~ >>> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC; >>> >>> >>> QUERY PLAN >>> -------------------------------------------------------------- >>> ---------------------------------------------------------------------- >>> Sort (cost=7925.07..7925.15 rows=31 width=67) (actual >>> time=857.715..857.716 rows=2 loops=1) >>> Sort Key: length((prefix)::text) >>> -> Bitmap Heap Scan on destlist (cost=60.16..7924.30 >>> rows=31 width=67) >>> (actual time=2.156..857.686 rows=2 loops=1) >>> Recheck Cond: ((id_ent = -2) AND (dir = 0)) >>> Filter: ('0039051248787'::text ~ (prefix)::text) >>> -> Bitmap Index Scan on destlist_indx2 (cost=0.00..60.16 >>> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1) >>> Index Cond: ((id_ent = -2) AND (dir = 0)) >>> Total runtime: 857.804 ms >>> (8 rows) >>> >>> "mmumu" btree (prefix varchar_pattern_ops) >>> >> >> I'm surpised Postgres isn't using the index on prefix seeing as the index >> uses the varchar_pattern_ops operator class. It could be that the index >> isn't selective enough, or is Postgres not able to use an index with >> Posix >> regular expressions? The docs seem to say that it can, but I'd be >> curious >> to see what happens if you use LIKE instead of ~. >> >> Dave >> >> >> >> >> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006 >
On Sun, 2006-12-03 at 02:53 +0200, Alexandru Coseru wrote: > Hello.. > > Thanks for the tip , i think i have got the ideea.. > > I'm too tired too , and i will try it tommorow. > > > Anyway , anybody has a clue why this regex is that CPU intensive ? I did > not saw the light on my drives blinking , and also vmstat doesn't yeld any > blocks in or out... > And how can it be optimized ? I think it's just that you're running the regex across 5400 or so elements. at 850 or so milliseconds, that comes out to 150uS for each comparison. I'd say it's just the number of comparisons that have to be made that's killing you here.
"Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> writes: > Anyway , anybody has a clue why this regex is that CPU intensive ? The EXPLAIN result you posted offers *no* evidence that the regexp is CPU intensive. All you know is that it took 850+ msec to fetch 5200 rows from disk and apply the regexp filter to them. There's no evidence here that that was CPU time and not I/O time. regards, tom lane
On Sun, 3 Dec 2006, Alexandru Coseru wrote: > Hello.. > > I have never used tsearch2 , but at a first glance , i would not see any > major improvement , because the main advantage of tsearch is the splitting in > words of a phrase.. > But here , i only have one word (no spaces). Oh, yes, I was confused :) What if you consider you prefix as 1.2.3.4.5.6, then you could try our contrib/ltree module. Oleg > > > Regards > Alex > ----- Original Message ----- From: "Oleg Bartunov" <oleg@sai.msu.su> > To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> > Cc: "Dave Dutcher" <dave@tridecap.com>; <pgsql-performance@postgresql.org> > Sent: Saturday, December 02, 2006 10:54 PM > Subject: Re: [PERFORM] Regex performance issue > > >> I may miss something but I'd use tsearch2. Check >> intdict dictionary for basic idea - >> http://www.sai.msu.su/~megera/wiki/Gendict >> >> Oleg >> On Sat, 2 Dec 2006, Alexandru Coseru wrote: >> >>> Hello... >>> >>> I cannot use LIKE , because the order of the match is reversed. >>> The prefix column is containing telephone destinations. >>> IE: ^001 - US , ^0039 Italy , etc.. >>> >>> Here is a small sample: >>> >>> asterisk=> select * from destlist LIMIT 10; >>> id | id_ent | dir | prefix | country | network | tip >>> ----+--------+-----+------------+-------------+--------------------+----- >>> 1 | -1 | 0 | (^0093) | AFGHANISTAN | AFGHANISTAN | 6 >>> 2 | -1 | 0 | (^00937) | AFGHANISTAN | AFGHANISTAN Mobile | 5 >>> 3 | -1 | 0 | (^00355) | ALBANIA | ALBANIA | 6 >>> 4 | -1 | 0 | (^0035538) | ALBANIA | ALBANIA Mobile | 5 >>> 5 | -1 | 0 | (^0035568) | ALBANIA | ALBANIA Mobile | 5 >>> 6 | -1 | 0 | (^0035569) | ALBANIA | ALBANIA Mobile | 5 >>> 7 | -1 | 0 | (^00213) | ALGERIA | ALGERIA | 6 >>> 8 | -1 | 0 | (^0021361) | ALGERIA | ALGERIA Mobile | 5 >>> 9 | -1 | 0 | (^0021362) | ALGERIA | ALGERIA Mobile | 5 >>> 10 | -1 | 0 | (^0021363) | ALGERIA | ALGERIA Mobile | 5 >>> >>> >>> Now , I have to match a dialednumber (let's say 00213618833) and find >>> it's destination...(It's algeria mobile). >>> I tried to make with a query of using LIKE , but i was not able to get >>> something.. >>> >>> >>> Regards >>> Alex >>> >>> >>> >>> >>> >>> ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com> >>> To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>; >>> <pgsql-performance@postgresql.org> >>> Sent: Saturday, December 02, 2006 10:36 PM >>> Subject: RE: [PERFORM] Regex performance issue >>> >>> >>>> -----Original Message----- >>>> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru >>>> Coseru >>>> asterisk=> explain analyze SELECT * FROM destlist WHERE >>>> '0039051248787' ~ >>>> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC; >>>> >>>> >>>> QUERY PLAN >>>> -------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> Sort (cost=7925.07..7925.15 rows=31 width=67) (actual >>>> time=857.715..857.716 rows=2 loops=1) >>>> Sort Key: length((prefix)::text) >>>> -> Bitmap Heap Scan on destlist (cost=60.16..7924.30 >>>> rows=31 width=67) >>>> (actual time=2.156..857.686 rows=2 loops=1) >>>> Recheck Cond: ((id_ent = -2) AND (dir = 0)) >>>> Filter: ('0039051248787'::text ~ (prefix)::text) >>>> -> Bitmap Index Scan on destlist_indx2 (cost=0.00..60.16 >>>> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1) >>>> Index Cond: ((id_ent = -2) AND (dir = 0)) >>>> Total runtime: 857.804 ms >>>> (8 rows) >>>> >>>> "mmumu" btree (prefix varchar_pattern_ops) >>>> >>> >>> I'm surpised Postgres isn't using the index on prefix seeing as the index >>> uses the varchar_pattern_ops operator class. It could be that the index >>> isn't selective enough, or is Postgres not able to use an index with Posix >>> regular expressions? The docs seem to say that it can, but I'd be curious >>> to see what happens if you use LIKE instead of ~. >>> >>> Dave >>> >>> >>> >>> >>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 >> >> >> >> -- >> No virus found in this incoming message. >> Checked by AVG Free Edition. >> Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006 >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hello.. I presumed CPU intensive because: 1) I have no hdd lights turn on during a series of queries (about 50 of them) 2) vmstat doesn't give me blocks in and just a couple of blocks out. 3) top reports between 95 and 100 user cpu . sometimes ,i can see some hi and si work (max of 2%). I'm pretty sure that the whole table is in cache.. The table size is about 50 Megs , and I have 1 Gb of RAM.. I have no other RAM eating procesess. Could it be related to CPU <-> RAM speed ? Regards Alex ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> Cc: "Heikki Linnakangas" <heikki@enterprisedb.com>; <pgsql-performance@postgresql.org> Sent: Sunday, December 03, 2006 6:05 AM Subject: Re: [PERFORM] Regex performance issue > "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> writes: >> Anyway , anybody has a clue why this regex is that CPU intensive ? > > The EXPLAIN result you posted offers *no* evidence that the regexp is > CPU intensive. All you know is that it took 850+ msec to fetch 5200 > rows from disk and apply the regexp filter to them. There's no evidence > here that that was CPU time and not I/O time. > > regards, tom lane > > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006 > >