Thread: Question about POSIX Regular Expressions performance on large dataset.
Question about POSIX Regular Expressions performance on large dataset.
From
Jose Ildefonso Camargo Tolosa
Date:
Hi! I'm analyzing the possibility of using PostgreSQL to store a huge amount of data (around 1000M records, or so....), and these, even though are short (each record just have a timestamp, and a string that is less than 128 characters in length), the strings will be matched against POSIX Regular Expressions (different regexps, and maybe complex). Because I don't have a system large enough to test this here, I have to ask you (I may borrow a medium-size server, but it would take a week or more, so I decided to ask here first). How is the performance of Regexp matching in PostgreSQL? Can it use indexes? My guess is: no, because I don't see a way of generally indexing to match regexp :( , so, tablescans for this huge dataset..... What do you think of this? Sincerely, Ildefonso Camargo
On Tue, Aug 17, 2010 at 8:21 PM, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> wrote: > Hi! > > I'm analyzing the possibility of using PostgreSQL to store a huge > amount of data (around 1000M records, or so....), and these, even > though are short (each record just have a timestamp, and a string that > is less than 128 characters in length), the strings will be matched > against POSIX Regular Expressions (different regexps, and maybe > complex). > > Because I don't have a system large enough to test this here, I have > to ask you (I may borrow a medium-size server, but it would take a > week or more, so I decided to ask here first). How is the performance > of Regexp matching in PostgreSQL? Can it use indexes? My guess is: > no, because I don't see a way of generally indexing to match regexp :( > , so, tablescans for this huge dataset..... > > What do you think of this? Yes it can index such things, but it has to index them in a fixed way.i.e. you can create functional indexes with pre-builtregexes. But for ones where the values change each time, you're correct, no indexes will be used. Could full text searching be used instead?
Re: Question about POSIX Regular Expressions performance on large dataset.
From
Jose Ildefonso Camargo Tolosa
Date:
Hi, again, I just had this wacky idea, and wanted to share it: what do you think of having the dataset divided among several servers, and sending the query to all of them, and then just have the application "unify" the results from all the servers? Would that work for this kind of *one table* search? (there are no joins, and will never be). I think it should, but: what do you think? Ildefonso. On Tue, Aug 17, 2010 at 9:51 PM, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> wrote: > Hi! > > I'm analyzing the possibility of using PostgreSQL to store a huge > amount of data (around 1000M records, or so....), and these, even > though are short (each record just have a timestamp, and a string that > is less than 128 characters in length), the strings will be matched > against POSIX Regular Expressions (different regexps, and maybe > complex). > > Because I don't have a system large enough to test this here, I have > to ask you (I may borrow a medium-size server, but it would take a > week or more, so I decided to ask here first). How is the performance > of Regexp matching in PostgreSQL? Can it use indexes? My guess is: > no, because I don't see a way of generally indexing to match regexp :( > , so, tablescans for this huge dataset..... > > What do you think of this? > > Sincerely, > > Ildefonso Camargo >
You can do something similar on the same machine if you can come up with a common way to partition your data. Then you split your 1B rows up into chunks of 10M or so and put each on a table and hit the right table. You can use partitioning / table inheritance if you want to, or just know the table name ahead of time. We did something similar with mnogo search. We break it up into a few hundred different schemas and hit the one for a particular site to keep the individual mnogo search tables small and fast. On Tue, Aug 17, 2010 at 8:30 PM, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> wrote: > Hi, again, > > I just had this wacky idea, and wanted to share it: > > what do you think of having the dataset divided among several servers, > and sending the query to all of them, and then just have the > application "unify" the results from all the servers? > > Would that work for this kind of *one table* search? (there are no > joins, and will never be). I think it should, but: what do you think? > > Ildefonso. > > On Tue, Aug 17, 2010 at 9:51 PM, Jose Ildefonso Camargo Tolosa > <ildefonso.camargo@gmail.com> wrote: >> Hi! >> >> I'm analyzing the possibility of using PostgreSQL to store a huge >> amount of data (around 1000M records, or so....), and these, even >> though are short (each record just have a timestamp, and a string that >> is less than 128 characters in length), the strings will be matched >> against POSIX Regular Expressions (different regexps, and maybe >> complex). >> >> Because I don't have a system large enough to test this here, I have >> to ask you (I may borrow a medium-size server, but it would take a >> week or more, so I decided to ask here first). How is the performance >> of Regexp matching in PostgreSQL? Can it use indexes? My guess is: >> no, because I don't see a way of generally indexing to match regexp :( >> , so, tablescans for this huge dataset..... >> >> What do you think of this? >> >> Sincerely, >> >> Ildefonso Camargo >> > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql > -- To understand recursion, one must first understand recursion.
Re: Question about POSIX Regular Expressions performance on large dataset.
From
Sergey Konoplev
Date:
On 18 August 2010 06:30, Jose Ildefonso Camargo Tolosa <ildefonso.camargo@gmail.com> wrote: > Hi, again, > > I just had this wacky idea, and wanted to share it: > > what do you think of having the dataset divided among several servers, > and sending the query to all of them, and then just have the > application "unify" the results from all the servers? > > Would that work for this kind of *one table* search? (there are no > joins, and will never be). I think it should, but: what do you think? There is a tool for this - http://plproxy.projects.postgresql.org/doc/tutorial.html > > Ildefonso. > > On Tue, Aug 17, 2010 at 9:51 PM, Jose Ildefonso Camargo Tolosa > <ildefonso.camargo@gmail.com> wrote: >> Hi! >> >> I'm analyzing the possibility of using PostgreSQL to store a huge >> amount of data (around 1000M records, or so....), and these, even >> though are short (each record just have a timestamp, and a string that >> is less than 128 characters in length), the strings will be matched >> against POSIX Regular Expressions (different regexps, and maybe >> complex). >> >> Because I don't have a system large enough to test this here, I have >> to ask you (I may borrow a medium-size server, but it would take a >> week or more, so I decided to ask here first). How is the performance >> of Regexp matching in PostgreSQL? Can it use indexes? My guess is: >> no, because I don't see a way of generally indexing to match regexp :( >> , so, tablescans for this huge dataset..... >> >> What do you think of this? >> >> Sincerely, >> >> Ildefonso Camargo >> > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql > -- Sergey Konoplev Blog: http://gray-hemp.blogspot.com / Linkedin: http://ru.linkedin.com/in/grayhemp / JID/GTalk: gray.ru@gmail.com / Skype: gray-hemp / ICQ: 29353802