Thread: repeated characters in SQL

repeated characters in SQL

From

Govind Chettiar

Date:

23 January 2016, 23:44:36

I have a simple table consisting of a bunch of English words. I am trying to find words that have repeated characters in them, for example

apple

tattoo

but not

orange

lemon

I know that only a maximum of one repetition can occur

I tried various options like

SELECT word FROM public."SpellItWords"

WHERE word ~ E'(.)\1{2,}'

SELECT word FROM public."SpellItWords"

WHERE word ~ E'([a-z])\1{2}'

What finally worked was this

SELECT word FROM public."SpellItWords"

WHERE word ~ E'(.)\\1'

But I don't really understand what this does...Can you explain?

Thanks!

Re: repeated characters in SQL

From

David Rowley

Date:

24 January 2016, 07:33:08

On 24 January 2016 at 12:44, Govind Chettiar <rashapoo@gmail.com> wrote:
> I have a simple table consisting of a bunch of English words.  I am trying
> to find words that have repeated characters in them, for example
> apple
> tattoo
>
> but not
>
> orange
> lemon
>
> I know that only a maximum of one repetition can occur
>
> I tried various options like
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'(.)\1{2,}'
>
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'([a-z])\1{2}'
>
> What finally worked was this
> SELECT word FROM  public."SpellItWords"
>  WHERE word ~ E'(.)\\1'
>
> But I don't really understand what this does...Can you explain?

The ~ operator is a regular expression matching operator, and the
(.)\1 is a regular expression. More details here
http://www.postgresql.org/docs/current/static/functions-matching.html

The regular expression . matches a single character, since that . is
wrapped in () the regex engine captures the match and stores it in a
variable, this is called a capture group. Since this is the first such
capture group in the regular expression, then the value matching the .
gets stored in the variable \1, so your regex basically says; "match a
single character which has the same single character to its immediate
right hand side". The extra \ is just an escape character.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: repeated characters in SQL

From

rashapoo@gmail.com

Date:

24 January 2016, 18:06:09

Thanks David...so it's looking at each character, storing it in /1, then comparing the "next" character with what is in /1.
I guess the escape character (which is not needed in, say, Notepad++) threw me a bit.

On Sun, Jan 24, 2016 at 2:32 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 24 January 2016 at 12:44, Govind Chettiar <rashapoo@gmail.com> wrote:
> I have a simple table consisting of a bunch of English words. I am trying
> to find words that have repeated characters in them, for example
> apple
> tattoo
>
> but not
>
> orange
> lemon
>
> I know that only a maximum of one repetition can occur
>
> I tried various options like
> SELECT word FROM public."SpellItWords"
> WHERE word ~ E'(.)\1{2,}'
>
> SELECT word FROM public."SpellItWords"
> WHERE word ~ E'([a-z])\1{2}'
>
> What finally worked was this
> SELECT word FROM public."SpellItWords"
> WHERE word ~ E'(.)\\1'
>
> But I don't really understand what this does...Can you explain?

The ~ operator is a regular expression matching operator, and the
(.)\1 is a regular expression. More details here
http://www.postgresql.org/docs/current/static/functions-matching.html

The regular expression . matches a single character, since that . is
wrapped in () the regex engine captures the match and stores it in a
variable, this is called a capture group. Since this is the first such
capture group in the regular expression, then the value matching the .
gets stored in the variable \1, so your regex basically says; "match a
single character which has the same single character to its immediate
right hand side". The extra \ is just an escape character.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: repeated characters in SQL

From

Francisco Olarte

Date:

24 January 2016, 18:31:58

On Sun, Jan 24, 2016 at 7:05 PM,  <rashapoo@gmail.com> wrote:
> I guess the escape character (which is not needed in, say, Notepad++) threw
> me a bit.

Notepad ++ is, AFAIK,  an editor, it SHOULD (within reason) let you
write any text.

The double quote is needed due to the quoting rules of the language.
You want the regexp engine to see the characters leftp, dot, rightp,
backslash, one. But backslah is the scape character in strings ( in
many languages ), so you need to escape it too.

The same happens in, for example, C and Java. To put those five chars
in a string you need "(.)\\1". If you send "(.)\1" to a C compiler it
will build the string leftp, dot, rightp, SOH=(char)(1). It will
arrive to the backslash when parsing, see it is followed by a digit
less than 8, interpret it as an octal escape, and emit the SOH.

Francisco Olarte.

Re: repeated characters in SQL

From

Tom Lane

Date:

24 January 2016, 18:41:20

Francisco Olarte <folarte@peoplecall.com> writes:
> On Sun, Jan 24, 2016 at 7:05 PM,  <rashapoo@gmail.com> wrote:
>> I guess the escape character (which is not needed in, say, Notepad++) threw
>> me a bit.

> Notepad ++ is, AFAIK,  an editor, it SHOULD (within reason) let you
> write any text.

> The double quote is needed due to the quoting rules of the language.
> You want the regexp engine to see the characters leftp, dot, rightp,
> backslash, one. But backslah is the scape character in strings ( in
> many languages ), so you need to escape it too.

In this particular case, the extra backslash was needed only because
the OP used E'...' syntax for his string literal.  In a plain SQL
string literal, backslash isn't special.

            regards, tom lane