Thread: Question about POSIX Regular Expressions performance on large dataset.

Question about POSIX Regular Expressions performance on large dataset.

From

Jose Ildefonso Camargo Tolosa

Date:

17 August 2010, 23:21:34

Hi!

I'm analyzing the possibility of using PostgreSQL to store a huge
amount of data (around 1000M records, or so....), and these, even
though are short (each record just have a timestamp, and a string that
is less than 128 characters in length), the strings will be matched
against POSIX Regular Expressions (different regexps, and maybe
complex).

Because I don't have a system large enough to test this here, I have
to ask you (I may borrow a medium-size server, but it would take a
week or more, so I decided to ask here first).  How is the performance
of Regexp matching in PostgreSQL?  Can it use indexes? My guess is:
no, because I don't see a way of generally indexing to match regexp :(
, so, tablescans for this huge dataset.....

What do you think of this?

Sincerely,

Ildefonso Camargo

Re: Question about POSIX Regular Expressions performance on large dataset.

From

Scott Marlowe

Date:

17 August 2010, 23:28:19

On Tue, Aug 17, 2010 at 8:21 PM, Jose Ildefonso Camargo Tolosa
<ildefonso.camargo@gmail.com> wrote:
> Hi!
>
> I'm analyzing the possibility of using PostgreSQL to store a huge
> amount of data (around 1000M records, or so....), and these, even
> though are short (each record just have a timestamp, and a string that
> is less than 128 characters in length), the strings will be matched
> against POSIX Regular Expressions (different regexps, and maybe
> complex).
>
> Because I don't have a system large enough to test this here, I have
> to ask you (I may borrow a medium-size server, but it would take a
> week or more, so I decided to ask here first).  How is the performance
> of Regexp matching in PostgreSQL?  Can it use indexes? My guess is:
> no, because I don't see a way of generally indexing to match regexp :(
> , so, tablescans for this huge dataset.....
>
> What do you think of this?

Yes it can index such things, but it has to index them in a fixed way.i.e. you can create functional indexes with
pre-builtregexes.  But 
for ones where the values change each time, you're correct, no indexes
will be used.

Could full text searching be used instead?

Re: Question about POSIX Regular Expressions performance on large dataset.

From

Jose Ildefonso Camargo Tolosa

Date:

17 August 2010, 23:30:10

Hi, again,

I just had this wacky idea, and wanted to share it:

what do you think of having the dataset divided among several servers,
and sending the query to all of them, and then just have the
application "unify" the results from all the servers?

Would that work for this kind of *one table* search? (there are no
joins, and will never be).  I think it should, but: what do you think?

Ildefonso.

On Tue, Aug 17, 2010 at 9:51 PM, Jose Ildefonso Camargo Tolosa
<ildefonso.camargo@gmail.com> wrote:
> Hi!
>
> I'm analyzing the possibility of using PostgreSQL to store a huge
> amount of data (around 1000M records, or so....), and these, even
> though are short (each record just have a timestamp, and a string that
> is less than 128 characters in length), the strings will be matched
> against POSIX Regular Expressions (different regexps, and maybe
> complex).
>
> Because I don't have a system large enough to test this here, I have
> to ask you (I may borrow a medium-size server, but it would take a
> week or more, so I decided to ask here first).  How is the performance
> of Regexp matching in PostgreSQL?  Can it use indexes? My guess is:
> no, because I don't see a way of generally indexing to match regexp :(
> , so, tablescans for this huge dataset.....
>
> What do you think of this?
>
> Sincerely,
>
> Ildefonso Camargo
>

Re: Question about POSIX Regular Expressions performance on large dataset.

From

Scott Marlowe

Date:

17 August 2010, 23:34:17

You can do something similar on the same machine if you can come up
with a common way to partition your data.  Then you split your 1B rows
up into chunks of 10M or so and put each on a table and hit the right
table.  You can use partitioning / table inheritance if you want to,
or just know the table name ahead of time.

We did something similar with mnogo search.  We break it up into a few
hundred different schemas and hit the one for a particular site to
keep the individual mnogo search tables small and fast.

On Tue, Aug 17, 2010 at 8:30 PM, Jose Ildefonso Camargo Tolosa
<ildefonso.camargo@gmail.com> wrote:
> Hi, again,
>
> I just had this wacky idea, and wanted to share it:
>
> what do you think of having the dataset divided among several servers,
> and sending the query to all of them, and then just have the
> application "unify" the results from all the servers?
>
> Would that work for this kind of *one table* search? (there are no
> joins, and will never be).  I think it should, but: what do you think?
>
> Ildefonso.
>
> On Tue, Aug 17, 2010 at 9:51 PM, Jose Ildefonso Camargo Tolosa
> <ildefonso.camargo@gmail.com> wrote:
>> Hi!
>>
>> I'm analyzing the possibility of using PostgreSQL to store a huge
>> amount of data (around 1000M records, or so....), and these, even
>> though are short (each record just have a timestamp, and a string that
>> is less than 128 characters in length), the strings will be matched
>> against POSIX Regular Expressions (different regexps, and maybe
>> complex).
>>
>> Because I don't have a system large enough to test this here, I have
>> to ask you (I may borrow a medium-size server, but it would take a
>> week or more, so I decided to ask here first).  How is the performance
>> of Regexp matching in PostgreSQL?  Can it use indexes? My guess is:
>> no, because I don't see a way of generally indexing to match regexp :(
>> , so, tablescans for this huge dataset.....
>>
>> What do you think of this?
>>
>> Sincerely,
>>
>> Ildefonso Camargo
>>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>



--
To understand recursion, one must first understand recursion.

Re: Question about POSIX Regular Expressions performance on large dataset.

From

Sergey Konoplev

Date:

18 August 2010, 02:51:56

On 18 August 2010 06:30, Jose Ildefonso Camargo Tolosa
<ildefonso.camargo@gmail.com> wrote:
> Hi, again,
>
> I just had this wacky idea, and wanted to share it:
>
> what do you think of having the dataset divided among several servers,
> and sending the query to all of them, and then just have the
> application "unify" the results from all the servers?
>
> Would that work for this kind of *one table* search? (there are no
> joins, and will never be).  I think it should, but: what do you think?

There is a tool for this -
http://plproxy.projects.postgresql.org/doc/tutorial.html

>
> Ildefonso.
>
> On Tue, Aug 17, 2010 at 9:51 PM, Jose Ildefonso Camargo Tolosa
> <ildefonso.camargo@gmail.com> wrote:
>> Hi!
>>
>> I'm analyzing the possibility of using PostgreSQL to store a huge
>> amount of data (around 1000M records, or so....), and these, even
>> though are short (each record just have a timestamp, and a string that
>> is less than 128 characters in length), the strings will be matched
>> against POSIX Regular Expressions (different regexps, and maybe
>> complex).
>>
>> Because I don't have a system large enough to test this here, I have
>> to ask you (I may borrow a medium-size server, but it would take a
>> week or more, so I decided to ask here first).  How is the performance
>> of Regexp matching in PostgreSQL?  Can it use indexes? My guess is:
>> no, because I don't see a way of generally indexing to match regexp :(
>> , so, tablescans for this huge dataset.....
>>
>> What do you think of this?
>>
>> Sincerely,
>>
>> Ildefonso Camargo
>>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>



--
Sergey Konoplev

Blog: http://gray-hemp.blogspot.com /
Linkedin: http://ru.linkedin.com/in/grayhemp /
JID/GTalk: gray.ru@gmail.com / Skype: gray-hemp / ICQ: 29353802