Thread: repeated characters in SQL
I have a simple table consisting of a bunch of English words. I am trying to find words that have repeated characters in them, for example
apple
tattoo
but not
orange
lemon
I know that only a maximum of one repetition can occur
I tried various options like
SELECT word FROM public."SpellItWords"
WHERE word ~ E'(.)\1{2,}'
SELECT word FROM public."SpellItWords"
WHERE word ~ E'([a-z])\1{2}'
What finally worked was this
SELECT word FROM public."SpellItWords"
WHERE word ~ E'(.)\\1'
But I don't really understand what this does...Can you explain?
Thanks!
On 24 January 2016 at 12:44, Govind Chettiar <rashapoo@gmail.com> wrote: > I have a simple table consisting of a bunch of English words. I am trying > to find words that have repeated characters in them, for example > apple > tattoo > > but not > > orange > lemon > > I know that only a maximum of one repetition can occur > > I tried various options like > SELECT word FROM public."SpellItWords" > WHERE word ~ E'(.)\1{2,}' > > SELECT word FROM public."SpellItWords" > WHERE word ~ E'([a-z])\1{2}' > > What finally worked was this > SELECT word FROM public."SpellItWords" > WHERE word ~ E'(.)\\1' > > But I don't really understand what this does...Can you explain? The ~ operator is a regular expression matching operator, and the (.)\1 is a regular expression. More details here http://www.postgresql.org/docs/current/static/functions-matching.html The regular expression . matches a single character, since that . is wrapped in () the regex engine captures the match and stores it in a variable, this is called a capture group. Since this is the first such capture group in the regular expression, then the value matching the . gets stored in the variable \1, so your regex basically says; "match a single character which has the same single character to its immediate right hand side". The extra \ is just an escape character. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks David...so it's looking at each character, storing it in /1, then comparing the "next" character with what is in /1.
I guess the escape character (which is not needed in, say, Notepad++) threw me a bit.
I guess the escape character (which is not needed in, say, Notepad++) threw me a bit.
On Sun, Jan 24, 2016 at 2:32 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 24 January 2016 at 12:44, Govind Chettiar <rashapoo@gmail.com> wrote:
> I have a simple table consisting of a bunch of English words. I am trying
> to find words that have repeated characters in them, for example
> apple
> tattoo
>
> but not
>
> orange
> lemon
>
> I know that only a maximum of one repetition can occur
>
> I tried various options like
> SELECT word FROM public."SpellItWords"
> WHERE word ~ E'(.)\1{2,}'
>
> SELECT word FROM public."SpellItWords"
> WHERE word ~ E'([a-z])\1{2}'
>
> What finally worked was this
> SELECT word FROM public."SpellItWords"
> WHERE word ~ E'(.)\\1'
>
> But I don't really understand what this does...Can you explain?
The ~ operator is a regular expression matching operator, and the
(.)\1 is a regular expression. More details here
http://www.postgresql.org/docs/current/static/functions-matching.html
The regular expression . matches a single character, since that . is
wrapped in () the regex engine captures the match and stores it in a
variable, this is called a capture group. Since this is the first such
capture group in the regular expression, then the value matching the .
gets stored in the variable \1, so your regex basically says; "match a
single character which has the same single character to its immediate
right hand side". The extra \ is just an escape character.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 24, 2016 at 7:05 PM, <rashapoo@gmail.com> wrote: > I guess the escape character (which is not needed in, say, Notepad++) threw > me a bit. Notepad ++ is, AFAIK, an editor, it SHOULD (within reason) let you write any text. The double quote is needed due to the quoting rules of the language. You want the regexp engine to see the characters leftp, dot, rightp, backslash, one. But backslah is the scape character in strings ( in many languages ), so you need to escape it too. The same happens in, for example, C and Java. To put those five chars in a string you need "(.)\\1". If you send "(.)\1" to a C compiler it will build the string leftp, dot, rightp, SOH=(char)(1). It will arrive to the backslash when parsing, see it is followed by a digit less than 8, interpret it as an octal escape, and emit the SOH. Francisco Olarte.
Francisco Olarte <folarte@peoplecall.com> writes: > On Sun, Jan 24, 2016 at 7:05 PM, <rashapoo@gmail.com> wrote: >> I guess the escape character (which is not needed in, say, Notepad++) threw >> me a bit. > Notepad ++ is, AFAIK, an editor, it SHOULD (within reason) let you > write any text. > The double quote is needed due to the quoting rules of the language. > You want the regexp engine to see the characters leftp, dot, rightp, > backslash, one. But backslah is the scape character in strings ( in > many languages ), so you need to escape it too. In this particular case, the extra backslash was needed only because the OP used E'...' syntax for his string literal. In a plain SQL string literal, backslash isn't special. regards, tom lane