Thread: Regex performance issue

Regex performance issue

From

"Alexandru Coseru"

Date:

02 December 2006, 15:24:21

Hello..

I have a low performance problem with regexp.

Here are the details:

asterisk=> explain analyze SELECT * FROM destlist WHERE '0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;
                                                             QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   ->  Bitmap Heap Scan on destlist  (cost=60.16..7924.30 rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
         Recheck Cond: ((id_ent = -2) AND (dir = 0))
         Filter: ('0039051248787'::text ~ (prefix)::text)
         ->  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
               Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)


The main problem is the query time.
As you can see , the use of index destlist_indx2  is pretty quick  (1.9 ms)
, but the regexp operation takes a lot of time  (857 ms).

How can i improve that ?

Regards
    Alex


PS:  Additional info:

[root@voce1 billing]# uname -a
Linux voce1 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:16:33 EDT 2005 x86_64
x86_64 x86_64 GNU/Linux

[root@voce1 billing]# free
             total       used       free     shared    buffers     cached
Mem:       1023912     977788      46124          0      80900     523868
-/+ buffers/cache:     373020     650892
Swap:      3172728       8488    3164240


Welcome to psql 8.1.2, the PostgreSQL interactive terminal.

asterisk=> \d destlist;
                                   Table "public.destlist"
 Column  |          Type          |                         Modifiers
---------+------------------------+-----------------------------------------------------------
 id      | bigint                 | not null default
nextval(('destlist_id'::text)::regclass)
 id_ent  | integer                |
 dir     | integer                |
 prefix  | character varying(255) |
 country | character varying(255) |
 network | character varying(255) |
 tip     | integer                |


Indexes:
    "destlist_unique" UNIQUE, btree (id_ent, dir, prefix)
    "destlist_indx2" btree (id_ent, dir)
    "destlist_indx3" btree (id_ent, dir, prefix)
    "mmumu" btree (prefix varchar_pattern_ops)


asterisk=> select count(*) from destlist;
 count
--------
 576424
(1 row)






[root@voce1 billing]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.00GHz
stepping        : 3
cpu MHz         : 2992.658
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips        : 5914.62
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.00GHz
stepping        : 3
cpu MHz         : 2992.658
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl cid cx16 xtpr
bogomips        : 5980.16
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Re: Regex performance issue

From

"Dave Dutcher"

Date:

02 December 2006, 16:36:57

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru Coseru
> asterisk=> explain analyze SELECT * FROM destlist WHERE
> '0039051248787' ~
> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;
>
>
> QUERY PLAN
> --------------------------------------------------------------
> ----------------------------------------------------------------------
>  Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
> time=857.715..857.716 rows=2 loops=1)
>    Sort Key: length((prefix)::text)
>    ->  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
> rows=31 width=67)
> (actual time=2.156..857.686 rows=2 loops=1)
>          Recheck Cond: ((id_ent = -2) AND (dir = 0))
>          Filter: ('0039051248787'::text ~ (prefix)::text)
>          ->  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
>                Index Cond: ((id_ent = -2) AND (dir = 0))
>  Total runtime: 857.804 ms
> (8 rows)
>
>
>     "mmumu" btree (prefix varchar_pattern_ops)
>

I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave

Re: Regex performance issue

From

"Alexandru Coseru"

Date:

02 December 2006, 16:49:00

Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:    ^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk=> select * from destlist LIMIT 10;
 id | id_ent | dir |   prefix   |   country   |      network       | tip
----+--------+-----+------------+-------------+--------------------+-----
  1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
  2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
  3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
  4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
  5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
  6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
  7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
  8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
  9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get
something..


Regards
    Alex





----- Original Message -----
From: "Dave Dutcher" <dave@tridecap.com>
To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>;
<pgsql-performance@postgresql.org>
Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru Coseru
> asterisk=> explain analyze SELECT * FROM destlist WHERE
> '0039051248787' ~
> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;
>
>
> QUERY PLAN
> --------------------------------------------------------------
> ----------------------------------------------------------------------
>  Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
> time=857.715..857.716 rows=2 loops=1)
>    Sort Key: length((prefix)::text)
>    ->  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
> rows=31 width=67)
> (actual time=2.156..857.686 rows=2 loops=1)
>          Recheck Cond: ((id_ent = -2) AND (dir = 0))
>          Filter: ('0039051248787'::text ~ (prefix)::text)
>          ->  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
>                Index Cond: ((id_ent = -2) AND (dir = 0))
>  Total runtime: 857.804 ms
> (8 rows)
>
>
>     "mmumu" btree (prefix varchar_pattern_ops)
>

I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave





--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006

Re: Regex performance issue

From

Oleg Bartunov

Date:

02 December 2006, 16:55:08

I may miss something but I'd use tsearch2. Check
intdict dictionary for basic idea -  http://www.sai.msu.su/~megera/wiki/Gendict

Oleg
On Sat, 2 Dec 2006, Alexandru Coseru wrote:

> Hello...
>
> I cannot use LIKE , because the order of the match is reversed.
> The prefix column is containing telephone destinations.
> IE:    ^001  - US  , ^0039 Italy , etc..
>
> Here is a small sample:
>
> asterisk=> select * from destlist LIMIT 10;
> id | id_ent | dir |   prefix   |   country   |      network       | tip
> ----+--------+-----+------------+-------------+--------------------+-----
> 1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
> 2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
> 3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
> 4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
> 5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
> 6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
> 7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
> 8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
> 9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5
>
>
> Now , I have to match a dialednumber   (let's say   00213618833) and find
> it's destination...(It's algeria mobile).
> I tried to make with a query of using LIKE , but i was not able to get
> something..
>
>
> Regards
>   Alex
>
>
>
>
>
> ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com>
> To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>;
> <pgsql-performance@postgresql.org>
> Sent: Saturday, December 02, 2006 10:36 PM
> Subject: RE: [PERFORM] Regex performance issue
>
>
>> -----Original Message-----
>> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru Coseru
>> asterisk=> explain analyze SELECT * FROM destlist WHERE
>> '0039051248787' ~
>> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;
>>
>>
>> QUERY PLAN
>> --------------------------------------------------------------
>> ----------------------------------------------------------------------
>>  Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
>> time=857.715..857.716 rows=2 loops=1)
>>    Sort Key: length((prefix)::text)
>>    ->  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
>> rows=31 width=67)
>> (actual time=2.156..857.686 rows=2 loops=1)
>>          Recheck Cond: ((id_ent = -2) AND (dir = 0))
>>          Filter: ('0039051248787'::text ~ (prefix)::text)
>>          ->  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
>> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
>>                Index Cond: ((id_ent = -2) AND (dir = 0))
>>  Total runtime: 857.804 ms
>> (8 rows)
>>
>>
>>     "mmumu" btree (prefix varchar_pattern_ops)
>>
>
> I'm surpised Postgres isn't using the index on prefix seeing as the index
> uses the varchar_pattern_ops operator class.  It could be that the index
> isn't selective enough, or is Postgres not able to use an index with Posix
> regular expressions?  The docs seem to say that it can, but I'd be curious
> to see what happens if you use LIKE instead of ~.
>
> Dave
>
>
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Regex performance issue

From

"Heikki Linnakangas"

Date:

02 December 2006, 18:04:24

Alexandru Coseru wrote:
> I cannot use LIKE , because the order of the match is reversed.
> The prefix column is containing telephone destinations.
> IE:    ^001  - US  , ^0039 Italy , etc..

Maybe you could create a functional index on substr(<minimum length of
prefix>)? It might restrict the result set prior to applying the regex
just enough to make the performance acceptable.

> asterisk=> select * from destlist LIMIT 10;
> id | id_ent | dir |   prefix   |   country   |      network       | tip
> ----+--------+-----+------------+-------------+--------------------+-----
>  1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
>  2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
>  3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
>  4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
>  5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
>  6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
>  7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
>  8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
>  9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5
>
> Now , I have to match a dialednumber   (let's say   00213618833) and find it's destination...(It's algeria mobile).
> I tried to make with a query of using LIKE , but i was not able to get something..

Another idea would be to add some extra rows so that you could use
normal inequality searches. For example, let's take the Albanian rows:

   3 |     -1 |   0 | 00355
   4 |     -1 |   0 | 0035538
* 3 |     -1 |   0 | 0035539
   5 |     -1 |   0 | 0035568
   6 |     -1 |   0 | 0035569
* 3 |     -1 |   0 | 0035570

Now you can do "SELECT * FROM destlist WHERE ? >= prefix ORDER BY prefix
LIMIT 1".

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Regex performance issue

From

"Alexandru Coseru"

Date:

02 December 2006, 18:13:41

Hello..

I cannot use the first advice , because i'm not aware of the prefix length
in the database...
This is why i'm ordering after length(prefix)..


On the 2nd one , i'm not sure that i can follow you..

Regards
    Alex
----- Original Message -----
From: "Heikki Linnakangas" <heikki@enterprisedb.com>
To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
Cc: "Dave Dutcher" <dave@tridecap.com>; <pgsql-performance@postgresql.org>
Sent: Sunday, December 03, 2006 12:04 AM
Subject: Re: [PERFORM] Regex performance issue


> Alexandru Coseru wrote:
>> I cannot use LIKE , because the order of the match is reversed.
>> The prefix column is containing telephone destinations.
>> IE:    ^001  - US  , ^0039 Italy , etc..
>
> Maybe you could create a functional index on substr(<minimum length of
> prefix>)? It might restrict the result set prior to applying the regex
> just enough to make the performance acceptable.
>
>> asterisk=> select * from destlist LIMIT 10;
>> id | id_ent | dir |   prefix   |   country   |      network       | tip
>> ----+--------+-----+------------+-------------+--------------------+-----
>>  1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
>>  2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
>>  3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
>>  4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
>>  5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
>>  6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
>>  7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
>>  8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
>>  9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
>> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5
>>
>> Now , I have to match a dialednumber   (let's say   00213618833) and find
>> it's destination...(It's algeria mobile).
>> I tried to make with a query of using LIKE , but i was not able to get
>> something..
>
> Another idea would be to add some extra rows so that you could use normal
> inequality searches. For example, let's take the Albanian rows:
>
>   3 |     -1 |   0 | 00355
>   4 |     -1 |   0 | 0035538
> * 3 |     -1 |   0 | 0035539
>   5 |     -1 |   0 | 0035568
>   6 |     -1 |   0 | 0035569
> * 3 |     -1 |   0 | 0035570
>
> Now you can do "SELECT * FROM destlist WHERE ? >= prefix ORDER BY prefix
> LIMIT 1".
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
>
>

Re: Regex performance issue

From

"Heikki Linnakangas"

Date:

02 December 2006, 18:35:35

Alexandru Coseru wrote:
> Hello..
>
> I cannot use the first advice , because i'm not aware of the prefix
> length in the database...
> This is why i'm ordering after length(prefix)..
>
>
> On the 2nd one , i'm not sure that i can follow you..

Ok, let me try again :)

> asterisk=> select * from destlist LIMIT 10;
> id | id_ent | dir |   prefix   |   country   |      network       | tip
> ----+--------+-----+------------+-------------+--------------------+-----
>  1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
>  2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
>  3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
>  4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
>  5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
>  6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
>  7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
>  8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
>  9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5

Store the prefix in a character column, without the regex stuff. Like
below. I've removed the columns that are not relevant, in fact it would
make sense to store them in another table, and have just the id and
prefix in this table.

id | prefix  | network
---+---------+--------------------
  1 | 0093    | AFGHANISTAN
  2 | 00937   | AFGHANISTAN Mobile
  3 | 00355   | ALBANIA
  4 | 0035538 | ALBANIA Mobile
  5 | 0035568 | ALBANIA Mobile
  6 | 0035569 | ALBANIA Mobile
  7 | 00213   | ALGERIA
  8 | 0021361 | ALGERIA Mobile
  9 | 0021362 | ALGERIA Mobile
10 | 0021363 | ALGERIA Mobile

Now, add the rows marked with start below:

  id | prefix  | network
----+---------+--------------------
   1 | 0093    | AFGHANISTAN
   2 | 00937   | AFGHANISTAN Mobile
* 1 | 00938   | AFGHANISTAN
   3 | 00355   | ALBANIA
   4 | 0035538 | ALBANIA Mobile
* 3 | 0035539 | ALBANIA
   5 | 0035568 | ALBANIA Mobile
   6 | 0035569 | ALBANIA Mobile
* 3 | 003557  | ALBANIA
   7 | 00213   | ALGERIA
   8 | 0021361 | ALGERIA Mobile
   9 | 0021362 | ALGERIA Mobile
  10 | 0021363 | ALGERIA Mobile
* 7 | 0021364 | ALGERIA

The added rows make it unnecessary to use regex for the searches. You
can use just the >= operator like this: (using the example number you gave)

SELECT id FROM destlist WHERE '00213618833' >= prefix ORDER BY prefix
LIMIT 1

Which would return id 7. As another example, a query of "00213654321"
would match the last row, which again has id 7.

I'm too tired to figure out the exact algorithm for adding the rows, but
I'm pretty sure it can be automated... The basic idea is that when
there's a row with id A and prefix XXXX, and another row with id B and
prefix XXXXY, we add another row with id A and prefix XXXX(Y+1).

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Regex performance issue

From

"Alexandru Coseru"

Date:

02 December 2006, 20:54:14

Hello..

Thanks for the tip , i think i have got the ideea..

I'm too tired too , and i will try it tommorow.


Anyway , anybody has a clue why this regex is that CPU intensive ?  I did
not saw the light on my drives blinking , and also vmstat doesn't yeld any
blocks in or out...
And how can it be optimized ?

Is there a way to trace the system calls ?
strace doesn't give me anything else but some lseeks and reads...


PS: Tried it with a 8.2 snaphsot and the result is the same..

Regards
    Alex
----- Original Message -----
From: "Heikki Linnakangas" <heikki@enterprisedb.com>
To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
Cc: <pgsql-performance@postgresql.org>
Sent: Sunday, December 03, 2006 12:35 AM
Subject: Re: [PERFORM] Regex performance issue


> Alexandru Coseru wrote:
>> Hello..
>>
>> I cannot use the first advice , because i'm not aware of the prefix
>> length in the database...
>> This is why i'm ordering after length(prefix)..
>>
>>
>> On the 2nd one , i'm not sure that i can follow you..
>
> Ok, let me try again :)
>
>> asterisk=> select * from destlist LIMIT 10;
>> id | id_ent | dir |   prefix   |   country   |      network       | tip
>> ----+--------+-----+------------+-------------+--------------------+-----
>>  1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
>>  2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
>>  3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
>>  4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
>>  5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
>>  6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
>>  7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
>>  8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
>>  9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
>> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5
>
> Store the prefix in a character column, without the regex stuff. Like
> below. I've removed the columns that are not relevant, in fact it would
> make sense to store them in another table, and have just the id and prefix
> in this table.
>
> id | prefix  | network
> ---+---------+--------------------
>  1 | 0093    | AFGHANISTAN
>  2 | 00937   | AFGHANISTAN Mobile
>  3 | 00355   | ALBANIA
>  4 | 0035538 | ALBANIA Mobile
>  5 | 0035568 | ALBANIA Mobile
>  6 | 0035569 | ALBANIA Mobile
>  7 | 00213   | ALGERIA
>  8 | 0021361 | ALGERIA Mobile
>  9 | 0021362 | ALGERIA Mobile
> 10 | 0021363 | ALGERIA Mobile
>
> Now, add the rows marked with start below:
>
>  id | prefix  | network
> ----+---------+--------------------
>   1 | 0093    | AFGHANISTAN
>   2 | 00937   | AFGHANISTAN Mobile
> * 1 | 00938   | AFGHANISTAN
>   3 | 00355   | ALBANIA
>   4 | 0035538 | ALBANIA Mobile
> * 3 | 0035539 | ALBANIA
>   5 | 0035568 | ALBANIA Mobile
>   6 | 0035569 | ALBANIA Mobile
> * 3 | 003557  | ALBANIA
>   7 | 00213   | ALGERIA
>   8 | 0021361 | ALGERIA Mobile
>   9 | 0021362 | ALGERIA Mobile
>  10 | 0021363 | ALGERIA Mobile
> * 7 | 0021364 | ALGERIA
>
> The added rows make it unnecessary to use regex for the searches. You can
> use just the >= operator like this: (using the example number you gave)
>
> SELECT id FROM destlist WHERE '00213618833' >= prefix ORDER BY prefix
> LIMIT 1
>
> Which would return id 7. As another example, a query of "00213654321"
> would match the last row, which again has id 7.
>
> I'm too tired to figure out the exact algorithm for adding the rows, but
> I'm pretty sure it can be automated... The basic idea is that when there's
> a row with id A and prefix XXXX, and another row with id B and prefix
> XXXXY, we add another row with id A and prefix XXXX(Y+1).
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
>

Re: Regex performance issue

From

"Alexandru Coseru"

Date:

02 December 2006, 20:55:41

Hello..

I have never used tsearch2  , but at a first glance , i would not see any
major improvement , because the main advantage of tsearch is the splitting
in words of a phrase..
But here , i only have one word  (no spaces).


Regards
    Alex
----- Original Message -----
From: "Oleg Bartunov" <oleg@sai.msu.su>
To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
Cc: "Dave Dutcher" <dave@tridecap.com>; <pgsql-performance@postgresql.org>
Sent: Saturday, December 02, 2006 10:54 PM
Subject: Re: [PERFORM] Regex performance issue


>I may miss something but I'd use tsearch2. Check
> intdict dictionary for basic idea -
> http://www.sai.msu.su/~megera/wiki/Gendict
>
> Oleg
> On Sat, 2 Dec 2006, Alexandru Coseru wrote:
>
>> Hello...
>>
>> I cannot use LIKE , because the order of the match is reversed.
>> The prefix column is containing telephone destinations.
>> IE:    ^001  - US  , ^0039 Italy , etc..
>>
>> Here is a small sample:
>>
>> asterisk=> select * from destlist LIMIT 10;
>> id | id_ent | dir |   prefix   |   country   |      network       | tip
>> ----+--------+-----+------------+-------------+--------------------+-----
>> 1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
>> 2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
>> 3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
>> 4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
>> 5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
>> 6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
>> 7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
>> 8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
>> 9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
>> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5
>>
>>
>> Now , I have to match a dialednumber   (let's say   00213618833) and find
>> it's destination...(It's algeria mobile).
>> I tried to make with a query of using LIKE , but i was not able to get
>> something..
>>
>>
>> Regards
>>   Alex
>>
>>
>>
>>
>>
>> ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com>
>> To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>;
>> <pgsql-performance@postgresql.org>
>> Sent: Saturday, December 02, 2006 10:36 PM
>> Subject: RE: [PERFORM] Regex performance issue
>>
>>
>>> -----Original Message-----
>>> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru
>>> Coseru
>>> asterisk=> explain analyze SELECT * FROM destlist WHERE
>>> '0039051248787' ~
>>> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;
>>>
>>>
>>> QUERY PLAN
>>> --------------------------------------------------------------
>>> ----------------------------------------------------------------------
>>>  Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
>>> time=857.715..857.716 rows=2 loops=1)
>>>    Sort Key: length((prefix)::text)
>>>    ->  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
>>> rows=31 width=67)
>>> (actual time=2.156..857.686 rows=2 loops=1)
>>>          Recheck Cond: ((id_ent = -2) AND (dir = 0))
>>>          Filter: ('0039051248787'::text ~ (prefix)::text)
>>>          ->  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
>>> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
>>>                Index Cond: ((id_ent = -2) AND (dir = 0))
>>>  Total runtime: 857.804 ms
>>> (8 rows)
>>>
>>>     "mmumu" btree (prefix varchar_pattern_ops)
>>>
>>
>> I'm surpised Postgres isn't using the index on prefix seeing as the index
>> uses the varchar_pattern_ops operator class.  It could be that the index
>> isn't selective enough, or is Postgres not able to use an index with
>> Posix
>> regular expressions?  The docs seem to say that it can, but I'd be
>> curious
>> to see what happens if you use LIKE instead of ~.
>>
>> Dave
>>
>>
>>
>>
>>
>>
>
>  Regards,
>  Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
>

Re: Regex performance issue

From

Scott Marlowe

Date:

02 December 2006, 21:31:53

On Sun, 2006-12-03 at 02:53 +0200, Alexandru Coseru wrote:
> Hello..
>
> Thanks for the tip , i think i have got the ideea..
>
> I'm too tired too , and i will try it tommorow.
>
>
> Anyway , anybody has a clue why this regex is that CPU intensive ?  I did
> not saw the light on my drives blinking , and also vmstat doesn't yeld any
> blocks in or out...
> And how can it be optimized ?

I think it's just that you're running the regex across 5400 or so
elements.  at 850 or so milliseconds, that comes out to 150uS for each
comparison.  I'd say it's just the number of comparisons that have to be
made that's killing you here.

Re: Regex performance issue

From

Tom Lane

Date:

03 December 2006, 00:05:21

"Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> writes:
> Anyway , anybody has a clue why this regex is that CPU intensive ?

The EXPLAIN result you posted offers *no* evidence that the regexp is
CPU intensive.  All you know is that it took 850+ msec to fetch 5200
rows from disk and apply the regexp filter to them.  There's no evidence
here that that was CPU time and not I/O time.

            regards, tom lane

Re: Regex performance issue

From

Oleg Bartunov

Date:

03 December 2006, 03:05:33

On Sun, 3 Dec 2006, Alexandru Coseru wrote:

> Hello..
>
> I have never used tsearch2  , but at a first glance , i would not see any
> major improvement , because the main advantage of tsearch is the splitting in
> words of a phrase..
> But here , i only have one word  (no spaces).

Oh, yes, I was confused :) What if you consider you prefix as
1.2.3.4.5.6, then you could try our contrib/ltree module.


Oleg


>
>
> Regards
>   Alex
> ----- Original Message ----- From: "Oleg Bartunov" <oleg@sai.msu.su>
> To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
> Cc: "Dave Dutcher" <dave@tridecap.com>; <pgsql-performance@postgresql.org>
> Sent: Saturday, December 02, 2006 10:54 PM
> Subject: Re: [PERFORM] Regex performance issue
>
>
>> I may miss something but I'd use tsearch2. Check
>> intdict dictionary for basic idea -
>> http://www.sai.msu.su/~megera/wiki/Gendict
>>
>> Oleg
>> On Sat, 2 Dec 2006, Alexandru Coseru wrote:
>>
>>> Hello...
>>>
>>> I cannot use LIKE , because the order of the match is reversed.
>>> The prefix column is containing telephone destinations.
>>> IE:    ^001  - US  , ^0039 Italy , etc..
>>>
>>> Here is a small sample:
>>>
>>> asterisk=> select * from destlist LIMIT 10;
>>> id | id_ent | dir |   prefix   |   country   |      network       | tip
>>> ----+--------+-----+------------+-------------+--------------------+-----
>>> 1 |     -1 |   0 | (^0093)    | AFGHANISTAN | AFGHANISTAN        |   6
>>> 2 |     -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
>>> 3 |     -1 |   0 | (^00355)   | ALBANIA     | ALBANIA            |   6
>>> 4 |     -1 |   0 | (^0035538) | ALBANIA     | ALBANIA Mobile     |   5
>>> 5 |     -1 |   0 | (^0035568) | ALBANIA     | ALBANIA Mobile     |   5
>>> 6 |     -1 |   0 | (^0035569) | ALBANIA     | ALBANIA Mobile     |   5
>>> 7 |     -1 |   0 | (^00213)   | ALGERIA     | ALGERIA            |   6
>>> 8 |     -1 |   0 | (^0021361) | ALGERIA     | ALGERIA Mobile     |   5
>>> 9 |     -1 |   0 | (^0021362) | ALGERIA     | ALGERIA Mobile     |   5
>>> 10 |     -1 |   0 | (^0021363) | ALGERIA     | ALGERIA Mobile     |   5
>>>
>>>
>>> Now , I have to match a dialednumber   (let's say   00213618833) and find
>>> it's destination...(It's algeria mobile).
>>> I tried to make with a query of using LIKE , but i was not able to get
>>> something..
>>>
>>>
>>> Regards
>>>   Alex
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message ----- From: "Dave Dutcher" <dave@tridecap.com>
>>> To: "'Alexandru Coseru'" <alexandru.coseru@totaltelecom.ro>;
>>> <pgsql-performance@postgresql.org>
>>> Sent: Saturday, December 02, 2006 10:36 PM
>>> Subject: RE: [PERFORM] Regex performance issue
>>>
>>>
>>>> -----Original Message-----
>>>> From: pgsql-performance-owner@postgresql.org On Behalf Of Alexandru
>>>> Coseru
>>>> asterisk=> explain analyze SELECT * FROM destlist WHERE
>>>> '0039051248787' ~
>>>> prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;
>>>>
>>>>
>>>> QUERY PLAN
>>>> --------------------------------------------------------------
>>>> ----------------------------------------------------------------------
>>>>  Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
>>>> time=857.715..857.716 rows=2 loops=1)
>>>>    Sort Key: length((prefix)::text)
>>>>    ->  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
>>>> rows=31 width=67)
>>>> (actual time=2.156..857.686 rows=2 loops=1)
>>>>          Recheck Cond: ((id_ent = -2) AND (dir = 0))
>>>>          Filter: ('0039051248787'::text ~ (prefix)::text)
>>>>          ->  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
>>>> rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
>>>>                Index Cond: ((id_ent = -2) AND (dir = 0))
>>>>  Total runtime: 857.804 ms
>>>> (8 rows)
>>>>
>>>>     "mmumu" btree (prefix varchar_pattern_ops)
>>>>
>>>
>>> I'm surpised Postgres isn't using the index on prefix seeing as the index
>>> uses the varchar_pattern_ops operator class.  It could be that the index
>>> isn't selective enough, or is Postgres not able to use an index with Posix
>>> regular expressions?  The docs seem to say that it can, but I'd be curious
>>> to see what happens if you use LIKE instead of ~.
>>>
>>> Dave
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>  Regards,
>>  Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>>
>>
>> --
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Regex performance issue

From

"Alexandru Coseru"

Date:

03 December 2006, 07:11:44

Hello..

I presumed CPU intensive  because:
    1) I have no hdd lights turn on during a series of queries (about 50 of
them)
    2) vmstat doesn't give me blocks in and just a couple of blocks out.
    3) top  reports between 95 and 100 user cpu    .    sometimes ,i can see
some  hi and si work  (max of 2%).


I'm pretty sure that the whole table is in cache..
The table size is about 50 Megs  , and I have 1 Gb of RAM..
I have no other RAM eating procesess.

Could it be related to  CPU <-> RAM speed ?

Regards
    Alex

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
Cc: "Heikki Linnakangas" <heikki@enterprisedb.com>;
<pgsql-performance@postgresql.org>
Sent: Sunday, December 03, 2006 6:05 AM
Subject: Re: [PERFORM] Regex performance issue


> "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> writes:
>> Anyway , anybody has a clue why this regex is that CPU intensive ?
>
> The EXPLAIN result you posted offers *no* evidence that the regexp is
> CPU intensive.  All you know is that it took 850+ msec to fetch 5200
> rows from disk and apply the regexp filter to them.  There's no evidence
> here that that was CPU time and not I/O time.
>
> regards, tom lane
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
>
>