Thread: regex cache

regex cache

From
Josh Berkus
Date:
Folks,

I'm doing some analysis of PostgreSQL site traffic, and am being frequently 
hung up by the compile-time-fixed size of our regex cache (32 regexes, per 
MAX_CACHED_RES).  Is there a reason why it would be hard to use work_mem 
or some other dynamically changeable limit for regex caching?  

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: regex cache

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> I'm doing some analysis of PostgreSQL site traffic, and am being frequently 
> hung up by the compile-time-fixed size of our regex cache (32 regexes, per 
> MAX_CACHED_RES).  Is there a reason why it would be hard to use work_mem 
> or some other dynamically changeable limit for regex caching?  

Hmmm ... Spencer's regex library makes a point of hiding its internal
representation of a compiled regex from the calling code.  So measuring
the size of the regex cache in bytes would involve doing a lot of
violence to that API.  We could certainly allow the size of the cache
measured in number-of-regexes to be controlled, though.

Having said that, I'm not sure it'd help your problem.  If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently.  How do we fix that?
        regards, tom lane


Re: regex cache

From
Josh Berkus
Date:
Tom,

> Having said that, I'm not sure it'd help your problem.  If your query is
> using more than 32 regexes concurrently, it likely is using $BIGNUM
> regexes concurrently.  How do we fix that?

Hmmm.  I think there's a lot of ground between 32 and $BIGNUM.  For example, 
where I'm hitting a wall is 300 regexes.  Some quick testing on my opteron 
text machine right now shows that the execution time difference between 20rx 
and 50rx is around 20x.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: regex cache

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Having said that, I'm not sure it'd help your problem.  If your query is
>> using more than 32 regexes concurrently, it likely is using $BIGNUM
>> regexes concurrently.  How do we fix that?

> Hmmm.  I think there's a lot of ground between 32 and $BIGNUM.  For example, 
> where I'm hitting a wall is 300 regexes.  Some quick testing on my opteron 
> text machine right now shows that the execution time difference between 20rx 
> and 50rx is around 20x.

Hmm.  Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?
        regards, tom lane


Re: regex cache

From
Josh Berkus
Date:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>>> Having said that, I'm not sure it'd help your problem.  If your query is
>>> using more than 32 regexes concurrently, it likely is using $BIGNUM
>>> regexes concurrently.  How do we fix that?
> 
>> Hmmm.  I think there's a lot of ground between 32 and $BIGNUM.  For example, 
>> where I'm hitting a wall is 300 regexes.  Some quick testing on my opteron 
>> text machine right now shows that the execution time difference between 20rx 
>> and 50rx is around 20x.
> 
> Hmm.  Well, I still don't want to tie it to work_mem; how do you feel
> about a new GUC to determine the max number of cached REs?

Yeah.  You know me, I was just trying to avoid having more GUCs.

--Josh


Re: regex cache

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Tom Lane wrote:
>> Hmm.  Well, I still don't want to tie it to work_mem; how do you feel
>> about a new GUC to determine the max number of cached REs?

> Yeah.  You know me, I was just trying to avoid having more GUCs.

I'm not excited about it either, but I think if we're going to make
this adjustable it does need its own knob.  I can easily believe
that a large list of precompiled GUCs could be counterproductive
given a workload where you don't get much reuse, so I don't want
the list size going to the moon just because someone cranked up
work_mem for other purposes.

(I'm not real sure that that "self-organizing list" data structure
would work well beyond 1000 or so entries even if you did have
enough re-use to justify them all.  Anyone want to try to do some
performance testing?  In particular I think we might want to drop
the move-to-front approach in favor of move-up-one, just to avoid
O(N^2) memmove costs.)
        regards, tom lane


Re: regex cache

From
Josh Berkus
Date:
Tom,

> I'm not excited about it either, but I think if we're going to make
> this adjustable it does need its own knob.  I can easily believe
> that a large list of precompiled GUCs could be counterproductive
> given a workload where you don't get much reuse, so I don't want
> the list size going to the moon just because someone cranked up
> work_mem for other purposes.

Yes.  I was just trying to avoid thinking about it.  ;-)

> 
> (I'm not real sure that that "self-organizing list" data structure
> would work well beyond 1000 or so entries even if you did have
> enough re-use to justify them all.  Anyone want to try to do some
> performance testing?  In particular I think we might want to drop
> the move-to-front approach in favor of move-up-one, just to avoid
> O(N^2) memmove costs.)

Hmmm.  Yeah, I can see that.

Well, I have a test case here (the PostgreSQL download logs), or I 
wouldn't have brought up the issue.  I just need to find a way to 
multi-thread it so I can get the effect of multiple clients.

--Josh