Thread: Problem with index using regular expression

Problem with index using regular expression

From
Håvard Wahl Kongsgård
Date:
Hi, with psycopg 2-2.4.1 I have some issue with incorrect use of indexes using regular expression.
I my case I have create on index on the surname of users. 
CREATE INDEX  income_surname ON income(substring(income.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$'))

then used
SELECT Distinct On (nodes.id) nodes.ID from nodes inner join income on substring(nodes.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$')

with psql the index works fine, but with psycopg I don't think it's used at all. The index substring has to be identical with the substring in the query. Are there some encoding issues with psycopg?

--
Håvard Wahl Kongsgård

http://havard.security-review.net/

Re: Problem with index using regular expression

From
Federico Di Gregorio
Date:
On 06/06/11 08:32, Håvard Wahl Kongsgård wrote:
> Hi, with psycopg 2-2.4.1 I have some issue with incorrect use of indexes
> using regular expression.
> I my case I have create on index on the surname of users.
> CREATE INDEX  income_surname ON income(substring(income.name
> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$'))
>
> then used
> SELECT Distinct On (nodes.id <http://nodes.id>) nodes.ID from nodes
> inner join income on substring(nodes.name <http://nodes.name> from
> E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name
> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$')
>
> with psql the index works fine, but with psycopg I don't think it's used
> at all. The index substring has to be identical with the substring in
> the query. Are there some encoding issues with psycopg?

psycopg does encode your strings to the backend encoding but this
shouldn't be a problem. You can:

1) Send us the code so we can analyze it.
2) Enable PostgreSQL query logging and check the query sent by psycopg.
3) I feel you're doing something wrong with the '\' in the regular
   expression. Can you should us the Python of the execute() call?

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
                           There's no certainty - only opportunity. -- V

Re: Problem with index using regular expression

From
Håvard Wahl Kongsgård
Date:
conn = psycopg2.connect(database="tax_analytical", user="postgres", password="xxxx", host='xxxx')

cur = conn.cursor('nodes_database')

SQL = ("SELECT Distinct On (nodes.id) nodes.ID from nodes inner join income on substring(nodes.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') limit 1000")

cur.execute(SQL)

if __name__ == '__main__':
   pool = multiprocessing.Pool(processes=5)
   pool.map(network_file.main, cur, 1) # Ensure the chunk size is 1
   pool.close()
   pool.join()

cur.close()
conn.close()

-Håvard Wahl Kongsgård

On Mon, Jun 6, 2011 at 8:58 AM, Federico Di Gregorio <federico.digregorio@dndg.it> wrote:
On 06/06/11 08:32, Håvard Wahl Kongsgård wrote:
> Hi, with psycopg 2-2.4.1 I have some issue with incorrect use of indexes
> using regular expression.
> I my case I have create on index on the surname of users.
> CREATE INDEX  income_surname ON income(substring(income.name
> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$'))
>
> then used
> SELECT Distinct On (nodes.id <http://nodes.id>) nodes.ID from nodes
> inner join income on substring(nodes.name <http://nodes.name> from
> E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name
> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$')
>
> with psql the index works fine, but with psycopg I don't think it's used
> at all. The index substring has to be identical with the substring in
> the query. Are there some encoding issues with psycopg?

psycopg does encode your strings to the backend encoding but this
shouldn't be a problem. You can:

1) Send us the code so we can analyze it.
2) Enable PostgreSQL query logging and check the query sent by psycopg.
3) I feel you're doing something wrong with the '\' in the regular
  expression. Can you should us the Python of the execute() call?

federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
                          There's no certainty - only opportunity. -- V

--
Sent via psycopg mailing list (psycopg@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/psycopg



Re: Problem with index using regular expression

From
Federico Di Gregorio
Date:
On 06/06/11 09:03, Håvard Wahl Kongsgård wrote:
> substring(nodes.name <http://nodes.name> from
> E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name
> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') limit 1000")

Try doubling every '\' because that's a Python escape character too and
you want to send to the database the "\\X" sequence, not "\X"; i.e.,

... from E'\\\\w.*\\\\s(\\\\w.*)$|\\\\w+\\\\s(\\\\w.*)$') ...

Hope this helps,
federico

--
Federico Di Gregorio                         federico.digregorio@dndg.it
Studio Associato Di Nunzio e Di Gregorio                  http://dndg.it
  Qu'est ce que la folie? Juste un sentiment de liberté si
   fort qu'on en oublie ce qui nous rattache au monde... -- J. de Loctra

Re: Problem with index using regular expression

From
Alejandro Dubrovsky
Date:
On 06/06/11 17:07, Federico Di Gregorio wrote:
> On 06/06/11 09:03, Håvard Wahl Kongsgård wrote:
>> substring(nodes.name <http://nodes.name> from
>> E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name
>> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') limit 1000")
>
> Try doubling every '\' because that's a Python escape character too and
> you want to send to the database the "\\X" sequence, not "\X"; i.e.,
>
> ... from E'\\\\w.*\\\\s(\\\\w.*)$|\\\\w+\\\\s(\\\\w.*)$') ...
>
It's easier to just put an r at the front of the string.
ie
r"SELECT Distinct On (nodes.id) nodes.ID from nodes inner join income on
substring(nodes.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') =
substring(income.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') limit 1000"




Re: Problem with index using regular expression

From
Håvard Wahl Kongsgård
Date:
Thanks, that solved it

On Mon, Jun 6, 2011 at 9:10 AM, Alejandro Dubrovsky <alex.dubrovsky@hitwise.com> wrote:
On 06/06/11 17:07, Federico Di Gregorio wrote:
> On 06/06/11 09:03, Håvard Wahl Kongsgård wrote:
>> substring(nodes.name <http://nodes.name> from
>> E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') = substring(income.name
>> <http://income.name> from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') limit 1000")
>
> Try doubling every '\' because that's a Python escape character too and
> you want to send to the database the "\\X" sequence, not "\X"; i.e.,
>
> ... from E'\\\\w.*\\\\s(\\\\w.*)$|\\\\w+\\\\s(\\\\w.*)$') ...
>
It's easier to just put an r at the front of the string.
ie
r"SELECT Distinct On (nodes.id) nodes.ID from nodes inner join income on
substring(nodes.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') =
substring(income.name from E'\\w.*\\s(\\w.*)$|\\w+\\s(\\w.*)$') limit 1000"




--
Sent via psycopg mailing list (psycopg@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/psycopg