Thread: Re: tsearch parser inefficiency if text includes urls or emails - new version
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
"Kevin Grittner"
Date:
Andres Freund wrote: On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: >> It is in context format, applies cleanly, and passes "make check". > Unfortunately the latter is not saying much - I had a bug there and > it was not found by the regression tests. Perhaps I should take a > stab and add at least some more... Sounds like a good idea. The one thing to avoid is anything with a long enough run time to annoy those that run it many times a day. >> It is in context format, applies cleanly, and passes "make check". >> Next I read through the code, and have the same question that >> Andres posed 12 days ago. His patch massively reduces the cost of >> the parser recursively calling itself for some cases, and it seems >> like the least invasive way to modify the parser to solve this >> performance problem; but it does beg the question of why a state >> machine like this should recursively call itself when it hits >> certain states. > I was wondering about that as well. I am not completely sure but to > me it looks like its just done to reduce the amount of rules and > states. I'm assuming that's the reason, but didn't dig deep enough to be sure. I suspect to be really sure, I'd have to set it up without the recursion and see what breaks. I can't imagine it would be anything which couldn't be fixed by adding enough states; but perhaps they ran into something where these types would require so many new states that the recursion seemed like the lesser of evils. > I have to say that that code is not exactly clear and well > documented... Yeah. I was happy with the level of documentation that you added with your new code, but what was there before is mighty thin. If you gleaned enough information while working on it to feel comfortable adding documentation anywhere else, that would be a good thing. So far the only vote is to proceed with the mitigation, which was my preference, and apparently yours -- so I guess we're at 3 to 0 in favor of that. I'll mark the patch as "Waiting on Author" so you can add any comments and regression tests you feel are appropriate. By the way, I found one typo in the comments -- it should by useful, not usefull. I liked what I saw so far, but need to spend more time desk-checking for correctness, testing to confirm that it doesn't change results, and confirming the performance improvement. -Kevin
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
Andres Freund
Date:
On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote: > Andres Freund wrote: > > On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: > >> It is in context format, applies cleanly, and passes "make check". > > > > Unfortunately the latter is not saying much - I had a bug there and > > it was not found by the regression tests. Perhaps I should take a > > stab and add at least some more... > > Sounds like a good idea. The one thing to avoid is anything with a > long enough run time to annoy those that run it many times a day. Hm. There actually are tests excercising the part where I had a bug... Strange. It was a bug involving uninitialized data so probably the regression tests where just "lucky". > >> It is in context format, applies cleanly, and passes "make check". > >> Next I read through the code, and have the same question that > >> Andres posed 12 days ago. His patch massively reduces the cost of > >> the parser recursively calling itself for some cases, and it seems > >> like the least invasive way to modify the parser to solve this > >> performance problem; but it does beg the question of why a state > >> machine like this should recursively call itself when it hits > >> certain states. > > I was wondering about that as well. I am not completely sure but to > > me it looks like its just done to reduce the amount of rules and > > states. > I'm assuming that's the reason, but didn't dig deep enough to be sure. > I suspect to be really sure, I'd have to set it up without the > recursion and see what breaks. I can't imagine it would be anything > which couldn't be fixed by adding enough states; but perhaps they ran > into something where these types would require so many new states that > the recursion seemed like the lesser of evils. This is similar to my understanding... > > I have to say that that code is not exactly clear and well > > documented... > Yeah. I was happy with the level of documentation that you added with > your new code, but what was there before is mighty thin. If you > gleaned enough information while working on it to feel comfortable > adding documentation anywhere else, that would be a good thing. It definitely would be a good thing. But that would definitely be seperate patch. But I fear my current leel of knowledge is sufficient and also I am not sure if I can make myself interested enough in that part. > So far the only vote is to proceed with the mitigation, which was my > preference, and apparently yours -- so I guess we're at 3 to 0 in > favor of that. I'll mark the patch as "Waiting on Author" so you can > add any comments and regression tests you feel are appropriate. > > By the way, I found one typo in the comments -- it should by useful, > not usefull. Ok, will update. > I liked what I saw so far, but need to spend more time desk-checking > for correctness, testing to confirm that it doesn't change results, > and confirming the performance improvement. Thanks again for your reviewing! Andres
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
"Kevin Grittner"
Date:
Andres Freund <andres@anarazel.de> wrote: > On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote: >> Andres Freund wrote: >> > I had a bug there and it was not found by the regression tests. >> > Perhaps I should take a stab and add at least some more... >> >> Sounds like a good idea. > Hm. There actually are tests excercising the part where I had a > bug... Strange. It was a bug involving uninitialized data so > probably the regression tests where just "lucky". OK. I won't be looking for extra tests. >> > I have to say that that code is not exactly clear and well >> > documented... >> Yeah. I was happy with the level of documentation that you added >> with your new code, but what was there before is mighty thin. If >> you gleaned enough information while working on it to feel >> comfortable adding documentation anywhere else, that would be a >> good thing. > It definitely would be a good thing. But that would definitely be > seperate patch. But I fear my current leel of knowledge is > sufficient and also I am not sure if I can make myself interested > enough in that part. Fair enough. I won't be looking for new comments for the old code. >> By the way, I found one typo in the comments -- it should by >> useful, not usefull. > Ok, will update. Given how trivial that is, I'm putting this back in "Needs Review" status, and resuming my review work. Barring surprises, I should wrap this up whenever I can free up a two or three hours. -Kevin
After getting off to a good start, it looks like this patch is now stuck waiting for a second review pass from Kevin right now, with no open items for Andres to correct. Since the only issues on the table seem to be that of code aesthetics and long-term planning for this style of implementation rather than specific functional bits, I'm leaning toward saying this one is ready to have a committer look at it. Any comments from Kevin or Andres about where this is at? -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
Andres Freund
Date:
On Monday 07 December 2009 02:12:43 Greg Smith wrote: > After getting off to a good start, it looks like this patch is now stuck > waiting for a second review pass from Kevin right now, with no open > items for Andres to correct. Since the only issues on the table seem to > be that of code aesthetics and long-term planning for this style of > implementation rather than specific functional bits, I'm leaning toward > saying this one is ready to have a committer look at it. Any comments > from Kevin or Andres about where this is at? I think it should be ready - the only know thing it needs is a s/usefull/useful/. I will take another look but I doubt I will see anything new. Andres
Re: tsearch parser inefficiency if text includes urls or emails - new version
From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndquadrant.com> wrote: > After getting off to a good start, it looks like this patch is now > stuck waiting for a second review pass from Kevin right now, with > no open items for Andres to correct. Since the only issues on the > table seem to be that of code aesthetics and long-term planning > for this style of implementation rather than specific functional > bits, I'm leaning toward saying this one is ready to have a > committer look at it. Any comments from Kevin or Andres about > where this is at? Yeah, really the only thing I found to complain about was one misspelled word in a comment. I am currently the hold-up, due to fighting off a bout of some virus and having other "real world" issues impinge. The only thing left to do, besides correcting the spelling, is to confirm the author's performance improvements and confirm that there is no degradation in a non-targeted situation. Frankly, I'd be amazed if there was a performance regression, because all it really does is pass a pointer to a new spot in an existing input buffer rather than allocating new space and copying the input from the desired spot to the end of the buffer. I can't think of any situations where calculating the new address should be slower than calculating the new address and copying from there to the end of the buffer. -Kevin