Re: Optimizing query? - Mailing list pgsql-general
From | Pavel Stehule |
---|---|
Subject | Re: Optimizing query? |
Date | |
Msg-id | CAFj8pRCNHg9-4KKyhkF_X3Z3=svSvzd88RW1MB2a4405QS4vWA@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing query? (hamann.w@t-online.de) |
Responses |
Re: Optimizing query?
|
List | pgsql-general |
Hello 2013/1/31 <hamann.w@t-online.de>: > > Pavel Stehlule wrote: > >>> >> Hi, >>> >> >>> >> I am trying to match items from 2 tables based on a common string. >>> >> One is a big table which has one column with entries like XY123, ABC44, = >>> etc >>> >> The table has an index on that column. >>> >> The second table is, typically, much smaller >>> >> >>> >> select .... from tab1, tab2 where tab1.code =3D tab2.code; >>> >> >>> >> This works fine and fast. >>> >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D= >>> in the >>> >> big table and want them to match XY423, GF55 in the second table >>> >> >>> >> Variants I have tried >>> >> >>> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); >>> >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z= >>> ])'); >>> >> >>> > >>> > Have you tried the substring function? >>> > >>> > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D >>> > tab2.code >>> > > > Hi Pavel, it was just by chance that a fixed size substring would match the > data at hand. It is more common to have a digit/letter (or vice versa) boundary > or a hyphen there > >>> > >>> >> both take an enormous time. In the better case that I can subset (e.g. a= >>> ll candidates in table 2 >>> >> share initial "AX") I get back to manageable times by adding >>> >> and tab1.code ~ '^AX' >>> >> into the recipe. Actual runtime with about a million entries in tab1 and= >>> 800 entries in tab2 >>> >> is about 40 seconds. >>> >>> any join where result is related to some function result can be very >>> slow, because estimation will be out and any repeated function >>> evaluation is just expensive. >>> > I see the problem since obviously every the ~ operator with a non-constant > pattern is constantly recompiling the pattern. > > I wonder whether it would be possible to invent a prefix-match operator that approaches > the performance of string equality. I noted in the past (not sure whether anything > has changed in regex matching) that a constant leading part of regex would improve > performance, i.e. use an index scan to select possible candidates. > >>> You can try use a functional index. >>> >>> create index on tab2 ((substring(tab1.code from 1 for 5)) >>> > > What kind of trick is that - mixing two tables into a functional index? it is not possible - you can do some auxiliary table and creating indexes over this table but maybe https://github.com/dimitri/prefix can help Regards Pavel > What would the exact syntax be for that? > > Regards > Wolfgang Hamann > > > > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
pgsql-general by date: