Thread: using regular expression queries with regular expression indexes
Good afternoon,
--
We are currently running postgres version 12.2. I have a query that uses regexp code to confirm if a text field is a date or not. It is resource intensive and long running. I have made a regexp index (actually several flavors of regexp indexes) to aid in this queries performance. However I cannot seem to 'convince' the planner to use the indexes I create.
Following is table/index construction for your reference:
\d customertab
Table "public.customertab"
Column | Type | Collation | Nullable | Default
---------------------------+-----------------------------+-----------+----------+-------------------------------------------------
id | bigint | | not null | nextval('customer_attributes_id_seq'::regclass)
customer_attribute_key_id | integer | | |
attributable_id | integer | | |
attributable_type | character varying(255) | | |
value | character varying(255) | | |
created_at | timestamp without time zone | | |
updated_at | timestamp without time zone | | |
group_num | character varying(255) | | |
customer_transaction_id | integer | | |
account_id | integer | | |
Indexes:
"primets_custattr_pkey" PRIMARY KEY, btree (id)
"custattr_account_id" btree (account_id)
"custattr_value" btree (value) WHERE value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text AND value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text AND account_id IS NOT NULL
"custattr_value_acctid_keyid" btree (value, account_id, customer_attribute_key_id) WHERE value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text AND value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text AND account_id IS NOT NULL
Table "public.customertab"
Column | Type | Collation | Nullable | Default
---------------------------+-----------------------------+-----------+----------+-------------------------------------------------
id | bigint | | not null | nextval('customer_attributes_id_seq'::regclass)
customer_attribute_key_id | integer | | |
attributable_id | integer | | |
attributable_type | character varying(255) | | |
value | character varying(255) | | |
created_at | timestamp without time zone | | |
updated_at | timestamp without time zone | | |
group_num | character varying(255) | | |
customer_transaction_id | integer | | |
account_id | integer | | |
Indexes:
"primets_custattr_pkey" PRIMARY KEY, btree (id)
"custattr_account_id" btree (account_id)
"custattr_value" btree (value) WHERE value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text AND value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text AND account_id IS NOT NULL
"custattr_value_acctid_keyid" btree (value, account_id, customer_attribute_key_id) WHERE value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text AND value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text AND account_id IS NOT NULL
"custattr_value_includes" btree (value) INCLUDE (id, account_id, group_num) WHERE value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text AND value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text AND account_id IS NOT NULL
The query I am attempting to run:
explain select * FROM customertab ca
where ca.value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text
AND ca.value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text
AND ca.account_id IS NOT NULL
where ca.value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text
AND ca.value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text
AND ca.account_id IS NOT NULL
and the explain:
Bitmap Heap Scan on customertab ca (cost=9907.40..903793.74 rows=479041 width=95)
Recheck Cond: (((value)::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text) AND ((value)::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0
-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text) AND (accou
nt_id IS NOT NULL))
-> Bitmap Index Scan on custattr_value (cost=0.00..9787.64 rows=479041 width=0)
(3 rows)
Recheck Cond: (((value)::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text) AND ((value)::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0
-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text) AND (accou
nt_id IS NOT NULL))
-> Bitmap Index Scan on custattr_value (cost=0.00..9787.64 rows=479041 width=0)
(3 rows)
Finally if I remove the 'ca.account_id is not null' expression
explain select * FROM customertab ca
where ca.value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text
AND ca.value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text
where ca.value::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text
AND ca.value::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text
This is the explain:
Gather (cost=1000.00..1641792.30 rows=479041 width=95)
Workers Planned: 2
-> Parallel Seq Scan on customertab ca (cost=0.00..1592888.20 rows=199600 width=95)
Filter: (((value)::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text) AND ((value)::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0
-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text))
(4 rows)
Workers Planned: 2
-> Parallel Seq Scan on customertab ca (cost=0.00..1592888.20 rows=199600 width=95)
Filter: (((value)::text ~ '^([0-9][0-9][0-9][0-9])-(([0]?[1-9])|([1][0-2]))-(([0-2]?[0-9])|([3][0-1]))$'::text) AND ((value)::text ~ '([0-9][0-9][0-9][0-9])-(0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01])|([0-9][0-9][0
-9][0-9])-(0[469]|11])-(0[1-9]|[12][0-9]|30)|([0-9][0-9][0248][048]|[0-9][0-9][13579][26])-(02)-(0[1-9]|1[0-9]|2[0-9])|([0-9][0-9][0248][1235679]|[0-9][0-9][13579][01345789])-(02)-(0[1-9]|1[0-9]|2[0-8])'::text))
(4 rows)
Is there something else I can try, or am I missing something? Comments welcome.
Mark Steben
Database Administrator
@utoRevenue | Autobase
CRM division of Dominion Dealer Solutions
95D Ashley Ave.
West Springfield, MA 01089
t: 413.327-3045
f: 413.383-9567
Database Administrator
@utoRevenue | Autobase
CRM division of Dominion Dealer Solutions
95D Ashley Ave.
West Springfield, MA 01089
t: 413.327-3045
f: 413.383-9567
www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
www.drivedominion.com
Mark Steben <mark.steben@drivedominion.com> writes: > We are currently running postgres version 12.2. I have a query that uses > regexp code to confirm if a text field is a date or not. It is resource > intensive and long running. I have made a regexp index (actually several > flavors of regexp indexes) to aid in this queries performance. However I > cannot seem to 'convince' the planner to use the indexes I create. I don't understand your complaint. Your example shows that it *is* using the index: > -> Bitmap Index Scan on custattr_value (cost=0.00..9787.64 rows=479041 width=0) > Finally if I remove the 'ca.account_id is not null' expression > [ it doesn't use the index ] Well, of course not, because then the query doesn't match the index's WHERE condition (i.e., there could be rows the query needs to return that are not included in the index). But I imagine that restriction isn't very selective, so you could probably omit it from the index definition as well; then the index would work for queries with or without that restriction. regards, tom lane
Mark Steben schrieb am 25.06.2020 um 18:45: > I have a query that uses regexp code to confirm if a text field is a date or not. Why don't you define it as a date column? (e.g. so that leap years are corretly dealt with) But if you can't do that, what about a generated column that uses that regex to populate a boolean flag, e.g. is_valid_date? Then the expensive regex evaluation would only be done when you change the value, not everytime you query it. Thomas