Re: TODO item: Implement Boyer-Moore searching in LIKE queries - Mailing list pgsql-hackers

From Tom Lane
Subject Re: TODO item: Implement Boyer-Moore searching in LIKE queries
Date
Msg-id 8747.1470165235@sss.pgh.pa.us
Whole thread Raw
In response to Re: TODO item: Implement Boyer-Moore searching in LIKE queries  (Karan Sikka <karanssikka@gmail.com>)
Responses Re: TODO item: Implement Boyer-Moore searching in LIKE queries  (Karan Sikka <karanssikka@gmail.com>)
List pgsql-hackers
Karan Sikka <karanssikka@gmail.com> writes:
> Just for the record, I'm leaning to the side of "not worth it". My
> thoughts are, if you are at a scale where you care about this 20%
> speedup, you would want to go all the way to an indexed structure,
> because the gains you would realize would exceed 20%, and 20% may not be
> a sufficient speedup anyway.

If I'm reading your test case correctly, 20% is actually a rather
impressive number, because it's 20% *overall* gain on queries that will
also involve TOAST fetch and decompress on the source data.  (Decompress
definitely, and I'm guessing those 5K strings don't compress well enough
to avoid getting pushed out-of-line; though it might be worth repeating
the test with chunks of 10K or 20K to be sure.)  So the percentage
improvement in the LIKE test proper must have been a lot more than that.

However, I'm dubious that LIKE patterns with long fixed substrings are a
common use-case, so I'm afraid that this might be quite a lot of work for
something that won't much benefit most users.  I'm also worried that the
setup costs might be enough to make it a net loss in many cases.

There are probably ways to amortize the setup costs, since typical
scenarios involve the same LIKE pattern across many rows, but implementing
that would add even more work.  (Having said that, I've had a bee in my
bonnet for a long time about removing per-row setup cost for repetitive
regex matches, and whatever infrastructure that needs would work for this
too.  And for strpos' B-M-H setup, looks like.  So this might be something
to look into with a suitably wide view of what the problem is.)

Not sure what advice to give you here.  I think this is in the grey zone
where it's hard to be sure whether it's worth putting work into.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Vladimir Sitnikov
Date:
Subject: Re: Slowness of extended protocol
Next
From: Chapman Flack
Date:
Subject: Re: AdvanceXLInsertBuffer vs. WAL segment compressibility