8908 regcomp: reduce size of bitmap for multibyte locales

Review Request #2342 — Created Sept. 21, 2019 and submitted — Latest diff uploaded


regcomp: reduce size of bitmap for multibyte locales

This fixes the obscure endless loop seen with case-insensitive
patterns containing characters in 128-255 range; originally
found running GNU grep test suite.

Our regex implementation being kludgy translates the characters
in case-insensitive pattern to bracket expression containing both
cases for the character and doesn't correctly handle the case when
original character is in bitmap and the other case is not, falling
into the endless loop going through in p_bracket(), ordinary(),
and bothcases().

Reducing the bitmap to 0-127 range for multibyte locales solves this
as none of these characters have other case mapping outside of bitmap.
We are also safe in the case when the original character outside of
bitmap has other case mapping in the bitmap (there are several of those
in our current ctype maps having unidirectional mapping into bitmap).

$ cat multibyte.sh
export LC_CTYPE="C.UTF-8"

a=$(printf '\302\265\n')        # U+00B5
b=$(printf '\316\234\n')        # U+039C
c=$(printf '\316\274\n')        # U+03BC

echo '----------pre----------'
echo $b | sed -ne "/$a/Ip"
echo $c | sed -ne "/$a/Ip"
echo '----------post---------'
echo $b | LD_LIBRARY_PATH=~/ws/il8908/proto/root_i386/lib/ sed -ne "/$a/Ip"
echo $c | LD_LIBRARY_PATH=~/ws/il8908/proto/root_i386/lib/ sed -ne "/$a/Ip"

$ ./multibyte.sh
./multibyte.sh: line 8: 744790 Done                    echo $b
     744791 Segmentation Fault      (core dumped) | sed -ne "/$a/Ip"
./multibyte.sh: line 9: 744792 Done                    echo $c
     744793 Segmentation Fault      (core dumped) | sed -ne "/$a/Ip"