Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker) - Mailing list pgsql-patches

From Tom Lane
Subject Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Date
Msg-id 26072.1220739661@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)  ("David Rowley" <dgrowley@gmail.com>)
Responses Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
"David Rowley" <dgrowley@gmail.com> writes:
> I've made and attached the changes Heikki recommended.

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle.  Isn't that a bug?  Was the same bug present in the tests you
made to determine the best table sizes?

Stylistic issues:

I really dislike having two copies of the heuristic size-setting code.
I think it's worth introducing a second test of use_wchar in order to
arrange text_position_setup like this:

    ... existing code ...

    choose skiptablesize

    initialize skip table (this loop doesn't depend on char width)

    if (!state->use_wchar)
        process char needle
    else
        process wide-char needle

Also it might be worth having local-variable copies of skiptablesize and
len2 for this initialization loop:

    for (ai = 0; ai <= state->skiptablesize; ai++)
        state->skiptable[ai] = state->len2;

I'm not at all sure whether a C compiler will think that it's allowed to
avoid fetching these values again on each iteration, seeing that the
loop is storing into other integer members of *state.  Given that this
loop is exactly the overhead we're trying to control by adjusting
skiptablesize, making it as tight as possible seems worthwhile.

> Now that the skip table is a member of TextPositionState, I was not quite
> sure if I should #define a size for it. It would certainly look neater, only
> the code that defines the skip table size in text_position_start assumes 256
> elements.

I tend to agree that a #define is pointless there, since there's no easy
way to make the relevant code do anything with it.  It would be
worthwhile to point out adjacent to the code what the max length is,
though.  I had gotten as far as rewriting your introductory comment like
this

    /*
     * Prepare the skip table for Boyer-Moore-Horspool searching.  First we
     * must determine how big a skip table to use.  The declaration of
     * TextPositionState allows up to 256 elements, but with haystack lengths
     * of only a few characters we don't really want to have to initialize so
     * many elements --- it would take too long in comparison to the actual
     * search time.  So we choose a useful skip table size based on the
     * haystack length.

before noticing that what I was writing wasn't what the code actually
did.  Another point that doesn't seem to be included in your comments is
that the skip table size *must* be a power of 2 because we use bit
masking.  (In fact state->skiptablesize might better be named skiptablemask
since it's really 1 less than the table size.)

BTW, try to avoid introducing so much random vertical space.  IMHO it
does little for readability and much to make routines too long to fit in
one screenful.  Generally the PG code doesn't use double blank lines
anywhere inside a function, nor blank lines before a right brace line.
pg_indent will clean this up to some extent, but it would be better if
the code looked more like the project style in the first place.

            regards, tom lane

pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PgFoundry] Unsigned Data Types [1 of 2]
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)