Home > mailing lists

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

From	Vladimir Sitnikov
Subject	Re: speed up verifying UTF-8
Date	July 19, 2021 16:42:57
Msg-id	CAB=Je-GDpX=zy-5H-Pew1f+njOnG2aM0kRu7b+-ij71ZfKHTPQ@mail.gmail.com Whole thread Raw
In response to	Re: speed up verifying UTF-8 (John Naylor <john.naylor@enterprisedb.com>)
Responses	Re: speed up verifying UTF-8 (John Naylor <john.naylor@enterprisedb.com>)
List	pgsql-hackers

Tree view

Thank you,

It looks like it is important to have shrx for x86 which appears only when -march=x86-64-v3 is used (see https://github.com/golang/go/issues/47120#issuecomment-877629712 ).

Just in case: I know x86 wound not use fallback implementation, however, the sole purpose of shift-based DFA is to fold all the data-dependent ops into a single instruction.

An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case?

In other words, what if the implementation processes all characters always and uses a slower method in case of validation failure?

I would guess it is more important to be faster with accepting valid input rather than "faster to reject invalid input".

In shift-DFA approach, it would mean the validation loop would be simpler with fewer branches (see https://godbolt.org/z/hhMxhT6cf ):

static inline int
pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) {
uint64 class;

uint64 state = BGN;
while (s < end) { // clang unrolls the loop
class = ByteCategory[*s++];

state = class >> (state & DFA_MASK); // <-- note that AND is fused into the shift operation

}
return (state & DFA_MASK) != ERR;
}

Note: GCC does not seem to unroll "while(s<end)" loop by default, so manual unroll might be worth trying:

static inline int
pg_is_valid_utf8(const unsigned char *s, const unsigned char *end) {
uint64 class;
  uint64 state = BGN;
  while(s < end + 4) {
      for(int i = 0; i < 4; i++) {

class = ByteCategory[*s++];

state = class >> (state & DFA_MASK);

}

}
while(s < end) {
class = ByteCategory[*s++];

state = class >> (state & DFA_MASK);

}
return (state & DFA_MASK) != ERR;
}

----

static int pg_utf8_verifystr2(const unsigned char *s, int len) {
if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid, then just accept it

return s + len;

}
// slow path: the string is not valid, perform a slower analysis
return s + ....;
}

Vladimir

pgsql-hackers by date:

From: Tomas Vondra
Date: 19 July 2021, 16:34:48
Subject: Re: row filtering for logical replication

From: Aleksander Alekseev
Date: 19 July 2021, 16:56:24
Subject: Re: Addition of authenticated ID to pg_stat_activity

Re: speed up verifying UTF-8 - Mailing list pgsql-hackers

Previous

Next