Thread: looking for a faster way to do that
Hi, I have one large table (about a million entries) with an indexed column containing codes like ABC3561A, ABC3563X, 72-451-823 etc. (lots of order numbers from different manufacturers) When I ask for a specific item select code .... where code = 'ABC3563X' I get fast result. I also get fast result when doing a prefix match select code .... where code ~ '^ABC3563' If a am retrieving many items by joining with another table select code ..... where code = wantcode this is still fast. If I try to get many items on a prefix match select code .... where code ~ wantcode things go very slow. Explain shows a nested loop, so seemingly the table is rescanned for every wanted item in the other table. A test run (3000 wanted codes against a shortened table of 10000 ones) took about 200 seconds to complete What other queries could I use to get the requested selection? Regards Wolfgang
hamann.w <mailto:hamann.w@t-online.de> wrote:Gesendet: Mi 2011-09-21 17:59 > I have one large table (about a million entries) with an indexed column containing codes > like ABC3561A, ABC3563X, 72-451-823 etc. (lots of order numbers from different > manufacturers) > > When I ask for a specific item > select code .... where code = 'ABC3563X' > I get fast result. I also get fast result when doing a prefix match > select code .... where code ~ '^ABC3563' > > If a am retrieving many items by joining with another table > select code ..... where code = wantcode > this is still fast. > If I try to get many items on a prefix match > select code .... where code ~ wantcode > things go very slow. Explain shows a nested loop, so seemingly the table is rescanned > for every wanted item in the other table. A test run (3000 wanted codes against a > shortened table of 10000 ones) took about 200 seconds to complete > > What other queries could I use to get the requested selection? Is the index used for "where code ~ '^ABC3563'"? If not, then the result is fast only because the table is scanned only once, and it's just the factor of 3000 that's killing you. The second query (where code ~ wantcode) can never use an index because the pattern "wantcode" is unknown at query planning time. Yours, Laurenz Albe
If I try to get many items on a prefix match
select code .... where code ~ wantcode
things go very slow. Explain shows a nested loop, so seemingly the table is rescanned
for every wanted item in the other table. A test run (3000 wanted codes against a
shortened table of 10000 ones) took about 200 seconds to complete
What is the output of explain?
You say 'the other table', so presumably we're dealing with a foreign key here. Is there an index on that column?
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
Alban Hertroys <haramrae@gmail.com> wrote: >> What is the output of explain? >> >> You say 'the other table', so presumably we're dealing with a foreign key >> here. Is there an index on that column? Albe Laurenz wrote: >> Is the index used for "where code ~ '^ABC3563'"? >> >> If not, then the result is fast only because the table is scanned only once, >> and it's just the factor of 3000 that's killing you. >> >> The second query (where code ~ wantcode) can never use an index because >> the pattern "wantcode" is unknown at query planning time. >> >> Yours, >> Laurenz Albe Here I created a subset (just number and code matching a certain prefix) \d items Table "pg_temp_1.items" Column | Type | Modifiers --------+-----------------------+----------- num | integer | code | character varying(40) | create index itemsc on items (code); select count(*) from items; count ------- 9614 A single anchored query select * from items where code ~ '^ABC'; does indeed use the index to retrieve data. Next I copied a file of wanted codes create temp table n (wantcode text); \copy n from /tmp/rmartin.tmp the file contains plain names, i.e. unanchored matches explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode; Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1) Join Filter: (("outer".code)::text ~ "inner".wantcode) -> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1) -> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614) -> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1) Total runtime: 336286.692 ms An exact match "where items.code = n.wantcode" on the same data completes in 40 ms BTW: indexing the second table does not affect the query plan or the runtime, it just shows actual row count rather than estimate. This is, of course, bad; an anchored match could be faster and also is more appropriate to the scenario. So I change the contents of the second table update n set wantcode = textcat('^', wantcode); and try again, with similar results Nested Loop (cost=14.15..176478.01 rows=39178 width=36) (actual time=125.114..308831.697 rows=2871 loops=1) Join Filter: (("outer".code)::text ~ "inner".wantcode) -> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.061..2034.572 rows=9614 loops=1) -> Materialize (cost=14.15..22.30 rows=815 width=32) (actual time=0.001..1.095 rows=815 loops=9614) -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual time=0.114..1.893 rows=815 loops=1) Total runtime: 308837.746 ms I am aware that this is unlikely to work fast (the planner would perhaps need a hint in the query rather than in the data column to choose an anchored match algorithm (in case there is such an algo, of course) So I wonder whether there might be a different approach to this problem rather than pattern matching. I recall I had a similar problem before with a "contacts" column possibly containing one or more email addresses. Here searches would also be number of people times number of requests performance. I finally ended up with a @@ match (contrib/tsquery) and a supporting GIST index, but that only supports exact match, not prefix Regards Wolfgang Hamann
At 09:45 23/09/2011, hamann.w@t-online.de wrote: >A single anchored query >select * from items where code ~ '^ABC'; >does indeed use the index to retrieve data. > > >So I wonder whether there might be a different approach to this >problem rather than >pattern matching. >I recall I had a similar problem before with a "contacts" column >possibly containing one or more >email addresses. Here searches would also be number of people times >number of requests >performance. I finally ended up with a @@ match (contrib/tsquery) >and a supporting GIST index, >but that only supports exact match, not prefix You can try these, i doubt they will use any index but its a different approach: select * from items where length(items.code)<>length(rtrim(items.code,'ABC')); select * from items where strpos(items.code,'ABC')=0 or strpos(items.code,'any_substring')=0; HTH
Alban Hertroys <haramrae@gmail.com> wrote:
>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?Albe Laurenz wrote:Here I created a subset (just number and code matching a certain prefix)
>> Is the index used for "where code ~ '^ABC3563'"?
>>
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>
>> Yours,
>> Laurenz Albe
\d items
Table "pg_temp_1.items"
Column | Type | Modifiers
--------+-----------------------+-----------
num | integer |
code | character varying(40) |
create index itemsc on items (code);
select count(*) from items;
count
-------
9614
A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.
Next I copied a file of wanted codes
create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp
the file contains plain names, i.e. unanchored matches
explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
-> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
Total runtime: 336286.692 ms
So you're comparing a variable field value to a variable pattern - yeah, that's going to hurt. There's no way you could index exactly that.
Perhaps there's some way you can transform the problem so that you get something indexable?
For example, if your match patterns follow a certain pattern by themselves, you could add a column with the longest match pattern that would match the string. Then you could just do a query for which records have the match pattern (in that new column) that you're looking for and voila!
If something like that is possible strongly depends on what kind of match patterns you're using, of course.
An exact match "where items.code = n.wantcode" on the same data completes in 40 ms
That's an exact string match, of course that will be fast ;)
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
Eduardo Morras wrote: >> You can try these, i doubt they will use any index but its a >> different approach: >> >> select * from items where length(items.code)<>length(rtrim(items.code,'ABC')); >> >> select * from items where strpos(items.code,'ABC')=0 or >> strpos(items.code,'any_substring')=0; Hi, if I understand this right, it does not mean "check if the string appears at position 0" which could translate into an index query, but rather "check if the string appears anywhere and then check if that is position 0", so the entire table is checked. explain analyze select items.num, wantcode from items, n where strpos(code, wantcode) = 0; Nested Loop (cost=167.14..196066.54 rows=39178 width=36) (actual time=0.074..36639.312 rows=7832539 loops=1) Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0) -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual time=0.005..2.212 rows=815 loops=1) -> Materialize (cost=167.14..263.28 rows=9614 width=42) (actual time=0.007..13.970 rows=9614 loops=815) -> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.044..14.855 rows=9614 loops=1) Total runtime: 46229.836 ms The query ran much faster than the pattern query, however. This seems to be the performance of just searching for a plain string vs. initializing the regex engine every time (for 815 queries in a test set) Regards Wolfgang Hamann
Alban Hertroys wrote: >> So you're comparing a variable field value to a variable pattern - yeah, >> that's going to hurt. There's no way you could index exactly that. >> >> Perhaps there's some way you can transform the problem so that you get >> something indexable? >> For example, if your match patterns follow a certain pattern by themselves, >> you could add a column with the longest match pattern that would match the >> string. Then you could just do a query for which records have the match >> pattern (in that new column) that you're looking for and voila! >> >> If something like that is possible strongly depends on what kind of match >> patterns you're using, of course. Hi Alban, I already did that - the test set is just all records from the real table (about a million entries) that match the common 'ABC' prefix >>> An exact match "where items.code = n.wantcode" on the same data completes >>> in 40 ms >>> >> >> That's an exact string match, of course that will be fast ;) The main difference is: the fast query looks like explain select items.num, wantcode from items, n where code = wantcode; Merge Join (cost=53.56..1104.02 rows=39178 width=36) Merge Cond: (("outer".code)::text = "inner".wantcode) -> Index Scan using itemsc on items (cost=0.00..438.75 rows=9614 width=42) -> Sort (cost=53.56..55.60 rows=815 width=32) Sort Key: n.wantcode -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) and the slow ones looks like that one: Nested Loop (cost=14.15..176478.01 rows=39178 width=36) Join Filter: (("outer".code)::text ~ "inner".wantcode) So the database takes an entirely differnet approach at retrieving the entries. Regards Wolfgang Hamann
On 23 September 2011 14:29, <hamann.w@t-online.de> wrote: > > Alban Hertroys wrote: > > >> So you're comparing a variable field value to a variable pattern - yeah, > >> that's going to hurt. There's no way you could index exactly that. > >> > >> Perhaps there's some way you can transform the problem so that you get > >> something indexable? > >> For example, if your match patterns follow a certain pattern by themselves, > >> you could add a column with the longest match pattern that would match the > >> string. Then you could just do a query for which records have the match > >> pattern (in that new column) that you're looking for and voila! > >> > >> If something like that is possible strongly depends on what kind of match > >> patterns you're using, of course. > > Hi Alban, > > I already did that - the test set is just all records from the real table (about a million > entries) that match the common 'ABC' prefix I think you misunderstood what I wrote. Notice the difference between "which strings match the pattern" and "which records have the match pattern (in that new column)" - the first is a regular expression match (unindexable), while the second is a string equality match (indexable). What I'm suggesting is to add a column, which for the string 'ABCDEFG' would contain 'ABC%'. Data would look like: SELECT str, pattern FROM tbl; str | pattern ---------+--------- ABCDEFG | ABC% ABCDEF | ABC% BCDEFGH | BCD% etc. (can't format this properly in webmail, sorry) When you look for records that match the pattern 'ABC%', you would normally perform a query like: SELECT str FROM tbl WHERE str LIKE 'ABC%'; But with this new column, you would query: SELECT str FROM tbl WHERE pattern = 'ABC%'; As I said, it depends a lot on your pattern needs whether this solution would work at all for you. If you only ever use a few patterns, it will work. If you use many different patterns or don't know before-hand which patterns will be used, it won't work well at all. > The main difference is: the fast query looks like > > explain select items.num, wantcode from items, n where code = wantcode; > Merge Join (cost=53.56..1104.02 rows=39178 width=36) > Merge Cond: (("outer".code)::text = "inner".wantcode) > -> Index Scan using itemsc on items (cost=0.00..438.75 rows=9614 width=42) > -> Sort (cost=53.56..55.60 rows=815 width=32) > Sort Key: n.wantcode > -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) Is there an index on wantcode? If you have a million or more records, I would expect an index scan for a measly 815 matches... > and the slow ones looks like that one: > > Nested Loop (cost=14.15..176478.01 rows=39178 width=36) > Join Filter: (("outer".code)::text ~ "inner".wantcode) > > So the database takes an entirely differnet approach at retrieving the entries. Yes, because you're still using ~ there, with a pattern that's unknown at query planning time. That will only be fast under some fairly rare circumstances. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.
>> > >> > Hi Alban, >> > >> > I already did that - the test set is just all records from the real table= >> (about a million >> > entries) that match the common 'ABC' prefix >> >> I think you misunderstood what I wrote. Notice the difference between >> "which strings match the pattern" and "which records have the match >> pattern (in that new column)" - the first is a regular expression >> match (unindexable), while the second is a string equality match >> (indexable). >> >> What I'm suggesting is to add a column, which for the string 'ABCDEFG' >> would contain 'ABC%'. >> Data would look like: >> >> SELECT str, pattern FROM tbl; >> =A0str | pattern >> ---------+--------- >> =A0ABCDEFG | ABC% >> ABCDEF | ABC% >> BCDEFGH | BCD% >> etc. >> >> (can't format this properly in webmail, sorry) >> >> When you look for records that match the pattern 'ABC%', you would >> normally perform a query like: >> >> SELECT str FROM tbl WHERE str LIKE 'ABC%'; >> >> But with this new column, you would query: >> >> SELECT str FROM tbl WHERE pattern =3D 'ABC%'; >> >> As I said, it depends a lot on your pattern needs whether this >> solution would work at all for you. If you only ever use a few >> patterns, it will work. If you use many different patterns or don't >> know before-hand which patterns will be used, it won't work well at >> all. >> >> > The main difference is: the fast query looks like >> > >> > explain select items.num, wantcode from items, n where code =3D wantcode; >> > =A0Merge Join =A0(cost=3D53.56..1104.02 rows=3D39178 width=3D36) >> > =A0 Merge Cond: (("outer".code)::text =3D "inner".wantcode) >> > =A0 -> =A0Index Scan using itemsc on items =A0(cost=3D0.00..438.75 rows= >> =3D9614 width=3D42) >> > =A0 -> =A0Sort =A0(cost=3D53.56..55.60 rows=3D815 width=3D32) >> > =A0 =A0 =A0 =A0 Sort Key: n.wantcode >> > =A0 =A0 =A0 =A0 -> =A0Seq Scan on n =A0(cost=3D0.00..14.15 rows=3D815 wid= >> th=3D32) >> >> Is there an index on wantcode? If you have a million or more records, >> I would expect an index scan for a measly 815 matches... >> >> > and the slow ones looks like that one: >> > >> > =A0Nested Loop =A0(cost=3D14.15..176478.01 rows=3D39178 width=3D36) >> > =A0 Join Filter: (("outer".code)::text ~ "inner".wantcode) >> > >> > So the database takes an entirely differnet approach at retrieving the en= >> tries. >> >> Yes, because you're still using ~ there, with a pattern that's unknown >> at query planning time. That will only be fast under some fairly rare >> circumstances. Hi, the problem is that I read the patterns from a file, as part of the script. They are not known seperately. So it seems that creating the extra column is just the same effort as running the original query. The processing is one-time in nature. The one thing I can do is selecting a range of items on a common prefix, if all the codes in the second table have some characters in common >> Is there an index on wantcode? If you have a million or more records, >> I would expect an index scan for a measly 815 matches.. When I ran a test, there was no real difference with wantcode indexed or not It was interesting to try another suggestion and noting the difference between comparison functions, with identical (lack of) use of an index Regards Wolfgang Hamann
explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
-> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
Total runtime: 336286.692 ms
So you're comparing a variable field value to a variable pattern - yeah, that's going to hurt. There's no way you could index exactly that.
Perhaps there's some way you can transform the problem so that you get something indexable?
For example, if your match patterns follow a certain pattern by themselves, you could add a column with the longest match pattern that would match the string. Then you could just do a query for which records have the match pattern (in that new column) that you're looking for and voila!
I’ve only been partially following this thread but did you try something like:
WHERE items.code ~ (‘^’ || n.wantcode)
Not sure if this will be valid for your needs but the issue is that PostgreSQL cannot rely on an index for non-anchored search patterns and your compare-to data rightly does not contain regex meta-characters. If you explicitly indicate that the input expression is going to be anchored would PostgreSQL then realize it can use the index?
Not Tested.
David J.
At 09:45 23/09/2011, hamann.w@t-online.de wrote: >A single anchored query >select * from items where code ~ '^ABC'; >does indeed use the index to retrieve data. > > >So I wonder whether there might be a different approach to this >problem rather than >pattern matching. >I recall I had a similar problem before with a "contacts" column >possibly containing one or more >email addresses. Here searches would also be number of people times >number of requests >performance. I finally ended up with a @@ match (contrib/tsquery) >and a supporting GIST index, >but that only supports exact match, not prefix You can try these, i doubt they will use any index but its a different approach: select * from items where length(items.code)<>length(rtrim(items.code,'ABC')); select * from items where strpos(items.code,'ABC')=0 or strpos(items.code,'any_substring')=0; HTH
At 14:12 23/09/2011, hamann.w@t-online.de wrote: >Eduardo Morras wrote: > > >> You can try these, i doubt they will use any index but its a > >> different approach: > >> > >> select * from items where > length(items.code)<>length(rtrim(items.code,'ABC')); > >> > >> select * from items where strpos(items.code,'ABC')=0 or > >> strpos(items.code,'any_substring')=0; > >Hi, > >if I understand this right, it does not mean "check if the string >appears at position 0" >which could translate into an index query, but rather "check if the >string appears anywhere >and then check if that is position 0", so the entire table is checked. The second one yes, as it checks all patterns you want only one time per row they only needs one table scan. The first one eliminates the substring 'ABC' from the string, if the lengths of both strings are equal, the substring 'ABC' wasn't in it. If they are different, the trimmed string will be shorter. >explain analyze select items.num, wantcode from items, n where >strpos(code, wantcode) = 0; > Nested Loop (cost=167.14..196066.54 rows=39178 width=36) (actual > time=0.074..36639.312 rows=7832539 loops=1) > Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0) > -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual > time=0.005..2.212 rows=815 loops=1) > -> Materialize (cost=167.14..263.28 rows=9614 width=42) > (actual time=0.007..13.970 rows=9614 loops=815) > -> Seq Scan on items (cost=0.00..167.14 rows=9614 > width=42) (actual time=0.044..14.855 rows=9614 loops=1) > Total runtime: 46229.836 ms > > >The query ran much faster than the pattern query, however. This >seems to be the performance >of just searching for a plain string vs. initializing the regex >engine every time (for 815 >queries in a test set) It will do only one table scan while your original code will do one for each substring you want to test. You can add more and more substrings without too much cost. If you want to use the regex engine instead the postgresql string funtions check the regexp_matches(), it should be faster if you have 3000 substrings. select * from items where regexp_matches(items.code,'(ABC) (DE1) (any_substring)')<>{}; >Regards >Wolfgang Hamann HTH
At 14:12 23/09/2011, hamann.w@t-online.de wrote: >Eduardo Morras wrote: > > >> You can try these, i doubt they will use any index but its a > >> different approach: > >> > >> select * from items where > length(items.code)<>length(rtrim(items.code,'ABC')); > >> > >> select * from items where strpos(items.code,'ABC')=0 or > >> strpos(items.code,'any_substring')=0; > >Hi, > >if I understand this right, it does not mean "check if the string >appears at position 0" >which could translate into an index query, but rather "check if the >string appears anywhere >and then check if that is position 0", so the entire table is checked. The second one yes, as it checks all patterns you want only one time per row they only needs one table scan. The first one eliminates the substring 'ABC' from the string, if the lengths of both strings are equal, the substring 'ABC' wasn't in it. If they are different, the trimmed string will be shorter. >explain analyze select items.num, wantcode from items, n where >strpos(code, wantcode) = 0; > Nested Loop (cost=167.14..196066.54 rows=39178 width=36) (actual > time=0.074..36639.312 rows=7832539 loops=1) > Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0) > -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual > time=0.005..2.212 rows=815 loops=1) > -> Materialize (cost=167.14..263.28 rows=9614 width=42) > (actual time=0.007..13.970 rows=9614 loops=815) > -> Seq Scan on items (cost=0.00..167.14 rows=9614 > width=42) (actual time=0.044..14.855 rows=9614 loops=1) > Total runtime: 46229.836 ms > > >The query ran much faster than the pattern query, however. This >seems to be the performance >of just searching for a plain string vs. initializing the regex >engine every time (for 815 >queries in a test set) It will do only one table scan while your original code will do one for each substring you want to test. You can add more and more substrings without too much cost. If you want to use the regex engine instead the postgresql string funtions check the regexp_matches(), it should be faster if you have 3000 substrings. select * from items where regexp_matches(items.code,'(ABC) (DE1) (any_substring)')<>{}; >Regards >Wolfgang Hamann HTH
Eduardo Morras wrote: >> > >> >Hi, >> > >> >if I understand this right, it does not mean "check if the string >> >appears at position 0" >> >which could translate into an index query, but rather "check if the >> >string appears anywhere >> >and then check if that is position 0", so the entire table is checked. >> >> The second one yes, as it checks all patterns you want only one time >> per row they only needs one table scan. The first one eliminates the >> substring 'ABC' from the string, if the lengths of both strings are >> equal, the substring 'ABC' wasn't in it. If they are different, the >> trimmed string will be shorter. >> >> >explain analyze select items.num, wantcode from items, n where >> >strpos(code, wantcode) = 0; >> > Nested Loop (cost=167.14..196066.54 rows=39178 width=36) (actual >> > time=0.074..36639.312 rows=7832539 loops=1) >> > Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0) >> > -> Seq Scan on n (cost=0.00..14.15 rows=815 width=32) (actual >> > time=0.005..2.212 rows=815 loops=1) >> > -> Materialize (cost=167.14..263.28 rows=9614 width=42) >> > (actual time=0.007..13.970 rows=9614 loops=815) >> > -> Seq Scan on items (cost=0.00..167.14 rows=9614 >> > width=42) (actual time=0.044..14.855 rows=9614 loops=1) >> > Total runtime: 46229.836 ms >> > >> > >> >The query ran much faster than the pattern query, however. This >> >seems to be the performance >> >of just searching for a plain string vs. initializing the regex >> >engine every time (for 815 >> >queries in a test set) >> >> It will do only one table scan while your original code will do one >> for each substring you want to test. You can add more and more >> substrings without too much cost. If you want to use the regex engine >> instead the postgresql string funtions check the regexp_matches(), it >> should be faster if you have 3000 substrings. >> >> select * from items where regexp_matches(items.code,'(ABC) (DE1) >> (any_substring)')<>{}; >> Hi Eduardo, it is clear that scanning the table once with a list of matches will outperform rescanning the table for every string wanted. Now, my problem is that the patterns are dynamic as well. So if I could translate a table with one column and a few thousand rows into something like regexp_matches(code,'string1|string2|.....string2781') would ideally be a performant query. Unfortunately I have no idea how I could achieve this transformation inside the database. Doing it externally fails, because any single query cannot be more than so few characters. Regards Wolfgang Hamann
On 25 Sep 2011, at 8:04, hamann.w@t-online.de wrote: > Hi Eduardo, > > it is clear that scanning the table once with a list of matches will outperform > rescanning the table for every string wanted. Now, my problem is that the patterns are > dynamic as well. So if I could translate a table with one column and a few thousand rows > into something like > regexp_matches(code,'string1|string2|.....string2781') > would ideally be a performant query. Unfortunately I have no idea how I could achieve this > transformation inside the database. Doing it externally fails, because any single query cannot > be more than so few characters. To me it sounds a little bit like you're comparing every item in a warehouse to a set of descriptions to see what type ofitem it is, which is something you would be much better off storing as a property of the item. If an item is a fruit, storethat it's a fruit! But I'm guessing at what you're trying to accomplish, so here's a few other options... I guess you could create 2781 expression indexes to do what you want (is there a limit that prevents this?). Query planningwould probably become kind of slow and the indices will take up a considerable fraction of the total table storagerequired - that's a pretty outlandish approach. CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string1')); CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2')); ... CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2781')); Or are you really going to query every record against all 2781 regexes? I can't figure out a realistic scenario why you (oranyone) would want that. In that case those indices aren't going to help you much, as the planner would have to hold every record in tbl to each index- it won't do that. You could also create a giant lookup table (a materialized view, if you like) where the results of every match of str intbl against the wantcode in the regex table is stored. That's some huge overhead, but it will probably outperform mostother options. With the numbers you gave that table will hold about 2-3 billion records with two foreign key values anda truth value each. Alban Hertroys -- The scale of a problem often equals the size of an ego.
At 08:04 25/09/2011, hamann.w@t-online.de wrote: > >> select * from items where regexp_matches(items.code,'(ABC) (DE1) > >> (any_substring)')<>{}; > >> > >Hi Eduardo, > >it is clear that scanning the table once with a list of matches will >outperform >rescanning the table for every string wanted. Now, my problem is >that the patterns are >dynamic as well. So if I could translate a table with one >column and a few thousand rows >into something like >regexp_matches(code,'string1|string2|.....string2781') >would ideally be a performant query. Unfortunately I have no idea >how I could achieve this >transformation inside the database. Doing it externally fails, >because any single query cannot >be more than so few characters. You can create a plsql function and pass a setof text that do it. Sorry but instead saying you What Must You Type, WMYT(c), i prefer the How Should You Do way, HSYD(c). Note that you can get the same results using other approachs (f.ex. using FTS described in chapter 12) Check this topics: Function Creation http://www.postgresql.org/docs/9.0/static/sql-createfunction.html Tutorial about Function Creation http://www.adderpit.com/practical-postgresql/x10374.htm HTH
At 08:04 25/09/2011, hamann.w@t-online.de wrote: > >> select * from items where regexp_matches(items.code,'(ABC) (DE1) > >> (any_substring)')<>{}; > >> > >Hi Eduardo, > >it is clear that scanning the table once with a list of matches will >outperform >rescanning the table for every string wanted. Now, my problem is >that the patterns are >dynamic as well. So if I could translate a table with one >column and a few thousand rows >into something like >regexp_matches(code,'string1|string2|.....string2781') >would ideally be a performant query. Unfortunately I have no idea >how I could achieve this >transformation inside the database. Doing it externally fails, >because any single query cannot >be more than so few characters. You can create a plsql function and pass a setof text that do it. Sorry but instead saying you What Must You Type, WMYT(c), i prefer the How Should You Do way, HSYD(c). Note that you can get the same results using other approachs (f.ex. using FTS described in chapter 12) Check this topics: Function Creation http://www.postgresql.org/docs/9.0/static/sql-createfunction.html Tutorial about Function Creation http://www.adderpit.com/practical-postgresql/x10374.htm HTH
You forgot to include the list ;) On 26 Sep 2011, at 6:06, hamann.w@t-online.de wrote: > Alban Hertroys wrote: >>> >>> To me it sounds a little bit like you're comparing every item in a = >>> warehouse to a set of descriptions to see what type of item it is, which = >>> is something you would be much better off storing as a property of the = >>> item. If an item is a fruit, store that it's a fruit! >>> But I'm guessing at what you're trying to accomplish, so here's a few = >>> other options... >>> >>> I guess you could create 2781 expression indexes to do what you want (is = >>> there a limit that prevents this?). Query planning would probably become = >>> kind of slow and the indices will take up a considerable fraction of the = >>> total table storage required - that's a pretty outlandish approach. >>> >>> CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string1')); >>> CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2')); >>> ... >>> CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2781')); > > Hi, > > the strings are not really known before. > Let me explain the scenario; there is one table about products, and code is the > manufacturer's (or resellers') product id. > So, if ABC were a maker of laptops, ABC123 and ABC456 might be two different machines, > and ABC123G might have G3 mobile installed, or ABC123X might be the same thing > with extra memory. Obviously these device variants all look the same. > Now reseller sends us a collection of product images, so there would be ABC123.jpg > and ABC456.jpg > The database task at hand is matching products to images (and then inserting the image > name into a column of the products table). I guessed right then. The origin of your problem is that you have similar items in your database, but the database doesn'tknow they are similar. I'd fix that first, it makes the problem a whole lot easier to handle. For example, if an image comes in named ABC123G.jpg, you look up the product and manufacturer and update its image. Thenyou query for products of the same manufacturer that are similar to ABC123G (result: ABC123 and ABC123X) and update theirimages as well (if appropriate; perhaps they have a recent enough image of their own?). As another whacky alternative to your regular expressions; I think it would be possible to abuse the text-search functionalityin Postgres to match product id's. Those id's are basically a language per manufacturer describing product details. If you can split the product id's up into lexemes that describe the base product id and it's options, then you can use full-textsearch to match up expressions similar to the lexemes derived from the image name. For example: productid | lexemes -----------+------------------ ABC123 | {'ABC' '123'} ABC123G | {'ABC' '123' 'G'} ABC123X | {'ABC' '123' 'X'} ABC456 | {'ABC' '456'} I'm not really sure if that's possible, or how much work it would be per manufacturer - I haven't used FTS much. I'd first see if I couldn't add that similarity information to the products table, though ;) Alban Hertroys -- The scale of a problem often equals the size of an ego.
Eduardo Morras <nec556@retena.com> wrote: >> >> At 08:04 25/09/2011, hamann.w@t-online.de wrote: >> >> > >> select * from items where regexp_matches(items.code,'(ABC) (DE1) >> > >> (any_substring)')<>{}; >> > >> >> > >> >Hi Eduardo, >> > >> >it is clear that scanning the table once with a list of matches will >> >outperform >> >rescanning the table for every string wanted. Now, my problem is >> >that the patterns are >> >dynamic as well. So if I could translate a table with one >> >column and a few thousand rows >> >into something like >> >regexp_matches(code,'string1|string2|.....string2781') >> >would ideally be a performant query. Unfortunately I have no idea >> >how I could achieve this >> >transformation inside the database. Doing it externally fails, >> >because any single query cannot >> >be more than so few characters. >> >> You can create a plsql function and pass a setof text that do it. >> Sorry but instead saying you What Must You Type, WMYT(c), i prefer >> the How Should You Do way, HSYD(c). Note that you can get the same >> results using other approachs (f.ex. using FTS described in chapter 12) >> >> Check this topics: >> >> Function >> Creation http://www.postgresql.org/docs/9.0/static/sql-createfunction.html >> >> Tutorial about Function >> Creation http://www.adderpit.com/practical-postgresql/x10374.htm >> Hi, I tried the pl/sql approach to convert the contents of that patterns table into a regex. Results: 40 seconds runtime for 9500 candidates and 815 patterns 718 seconds for the same set of 9500 candidates, but using 4000 patterns instead. So it seems that I am reaching limits of pattern match As for the fulltext index (and the underlying tsquery): this is an exact match rather than prefix match, so I would need to know match patterns in advance in order to build the index I am thinking about that anyway (because ABC1234 likely should not match ABC123 pattern in my context), but I would sort of prefer a system where I can state the rules when I see the data set, rather than having to pre-create an index. Thanks for the tutorial link :) It seems that the responses on my post give all sorts of input that will help me on other tasks Regards Wolfgang Hamann
Alban Hertroys <haramrae@gmail.com> wrote: >> > Hi, >> >=20 >> > the strings are not really known before. >> > Let me explain the scenario; there is one table about products, and = >> code is the >> > manufacturer's (or resellers') product id. >> > So, if ABC were a maker of laptops, ABC123 and ABC456 might be two = >> different machines, >> > and ABC123G might have G3 mobile installed, or ABC123X might be the = >> same thing >> > with extra memory. Obviously these device variants all look the same. >> > Now reseller sends us a collection of product images, so there would = >> be ABC123.jpg >> > and ABC456.jpg >> > The database task at hand is matching products to images (and then = >> inserting the image >> > name into a column of the products table). >> >> >> I guessed right then. The origin of your problem is that you have = >> similar items in your database, but the database doesn't know they are = >> similar. I'd fix that first, it makes the problem a whole lot easier to = >> handle. >> >> For example, if an image comes in named ABC123G.jpg, you look up the = >> product and manufacturer and update its image. Then you query for = >> products of the same manufacturer that are similar to ABC123G (result: = >> ABC123 and ABC123X) and update their images as well (if appropriate; = >> perhaps they have a recent enough image of their own?). >> >> >> As another whacky alternative to your regular expressions; I think it = >> would be possible to abuse the text-search functionality in Postgres to = >> match product id's. Those id's are basically a language per manufacturer = >> describing product details. >> >> If you can split the product id's up into lexemes that describe the base = >> product id and it's options, then you can use full-text search to match = >> up expressions similar to the lexemes derived from the image name. >> >> For example: >> productid | lexemes >> -----------+------------------ >> ABC123 | {'ABC' '123'} >> ABC123G | {'ABC' '123' 'G'} >> ABC123X | {'ABC' '123' 'X'} >> ABC456 | {'ABC' '456'} >> >> I'm not really sure if that's possible, or how much work it would be per = >> manufacturer - I haven't used FTS much. >> >> I'd first see if I couldn't add that similarity information to the = >> products table, though ;) >> >> Alban Hertroys >> >> -- >> The scale of a problem often equals the size of an ego. >> >> Hi, the actual process tends to be - add products to database - then receive images and try to match them to products. So I know about the naming scheme only when I see a list of files, and would want to have a cmdline option for my matching script that distinguishes formats like a) exact b) alpha suffix following numeric main body c) period or slash between main and related To make things even more complex, I might receive images from a reseller that offers a few manufacturers using different conventions. I wonder whether this would translate well into building a temporary index, if I detect b or c patterns. When I asked first, I also started working on a different approach. This did work on the current case (I got similar datasets before, but with considerably fewer items), so I will probably try that and some ideas I got from this discussion, and see how far I get. This approach is a separate perl script that builds a tree structure of all the image names and then tries to create a regex for a crude filter. In the example it would have determined that all images in the lot match an ABC prefix. Then it selects all matching codes (so it can work with the entire database) and runs them through the same tree structure. Regards Wolfgang Hamann
At 18:18 26/09/2011, you wrote: >Eduardo Morras <nec556@retena.com> wrote: > > >> > >> At 08:04 25/09/2011, hamann.w@t-online.de wrote: > >> > >> > >> select * from items where regexp_matches(items.code,'(ABC) (DE1) > >> > >> (any_substring)')<>{}; > >> > >> > >> > > >> >Hi Eduardo, > >> > > >> >it is clear that scanning the table once with a list of matches will > >> >outperform > >> >rescanning the table for every string wanted. Now, my problem is > >> >that the patterns are > >> >dynamic as well. So if I could translate a table with one > >> >column and a few thousand rows > >> >into something like > >> >regexp_matches(code,'string1|string2|.....string2781') > >> >would ideally be a performant query. Unfortunately I have no idea > >> >how I could achieve this > >> >transformation inside the database. Doing it externally fails, > >> >because any single query cannot > >> >be more than so few characters. > >> > >> You can create a plsql function and pass a setof text that do it. > >> Sorry but instead saying you What Must You Type, WMYT(c), i prefer > >> the How Should You Do way, HSYD(c). Note that you can get the same > >> results using other approachs (f.ex. using FTS described in chapter 12) > >> > >> Check this topics: > >> > >> Function > >> > Creation http://www.postgresql.org/docs/9.0/static/sql-createfunction.html > >> > >> Tutorial about Function > >> Creation http://www.adderpit.com/practical-postgresql/x10374.htm > >> > >Hi, > >I tried the pl/sql approach to convert the contents of that patterns >table into a regex. >Results: 40 seconds runtime for 9500 candidates and 815 patterns >718 seconds for the same set of 9500 candidates, but using 4000 >patterns instead. >So it seems that I am reaching limits of pattern match Perhaps calling the function twice with half the values go faster. How do you call the function? EXECUTE or SELECT? If you use EXECUTE then the prepared plan in a previous call is ignored and is usually faster. Don't know if in your case it run faster but you can try it. >As for the fulltext index (and the underlying tsquery): this is an >exact match rather than prefix >match, so I would need to know match patterns in advance in order to >build the index > >I am thinking about that anyway (because ABC1234 likely should not >match ABC123 pattern >in my context), but I would sort of prefer a system where I can >state the rules when I >see the data set, rather than having to pre-create an index. > >Thanks for the tutorial link :) It's the one i used time ago. A bit old but very good one. >It seems that the responses on my post give all sorts of input that >will help me on other >tasks >Regards >Wolfgang Hamann
At 18:18 26/09/2011, you wrote: >Eduardo Morras <nec556@retena.com> wrote: > > >> > >> At 08:04 25/09/2011, hamann.w@t-online.de wrote: > >> > >> > >> select * from items where regexp_matches(items.code,'(ABC) (DE1) > >> > >> (any_substring)')<>{}; > >> > >> > >> > > >> >Hi Eduardo, > >> > > >> >it is clear that scanning the table once with a list of matches will > >> >outperform > >> >rescanning the table for every string wanted. Now, my problem is > >> >that the patterns are > >> >dynamic as well. So if I could translate a table with one > >> >column and a few thousand rows > >> >into something like > >> >regexp_matches(code,'string1|string2|.....string2781') > >> >would ideally be a performant query. Unfortunately I have no idea > >> >how I could achieve this > >> >transformation inside the database. Doing it externally fails, > >> >because any single query cannot > >> >be more than so few characters. > >> > >> You can create a plsql function and pass a setof text that do it. > >> Sorry but instead saying you What Must You Type, WMYT(c), i prefer > >> the How Should You Do way, HSYD(c). Note that you can get the same > >> results using other approachs (f.ex. using FTS described in chapter 12) > >> > >> Check this topics: > >> > >> Function > >> > Creation http://www.postgresql.org/docs/9.0/static/sql-createfunction.html > >> > >> Tutorial about Function > >> Creation http://www.adderpit.com/practical-postgresql/x10374.htm > >> > >Hi, > >I tried the pl/sql approach to convert the contents of that patterns >table into a regex. >Results: 40 seconds runtime for 9500 candidates and 815 patterns >718 seconds for the same set of 9500 candidates, but using 4000 >patterns instead. >So it seems that I am reaching limits of pattern match Perhaps calling the function twice with half the values go faster. How do you call the function? EXECUTE or SELECT? If you use EXECUTE then the prepared plan in a previous call is ignored and is usually faster. Don't know if in your case it run faster but you can try it. >As for the fulltext index (and the underlying tsquery): this is an >exact match rather than prefix >match, so I would need to know match patterns in advance in order to >build the index > >I am thinking about that anyway (because ABC1234 likely should not >match ABC123 pattern >in my context), but I would sort of prefer a system where I can >state the rules when I >see the data set, rather than having to pre-create an index. > >Thanks for the tutorial link :) It's the one i used time ago. A bit old but very good one. >It seems that the responses on my post give all sorts of input that >will help me on other >tasks >Regards >Wolfgang Hamann