Re: broken table formatting in psql - Mailing list pgsql-hackers

From John Naylor
Subject Re: broken table formatting in psql
Date
Msg-id CAFBsxsHU91b0FDevdO=JugYHMhBMym6k94aa-iqwtjPLFU5axA@mail.gmail.com
Whole thread Raw
In response to Re: broken table formatting in psql  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: broken table formatting in psql
List pgsql-hackers
On Fri, Sep 2, 2022 at 12:17 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 01 Sep 2022 18:22:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > At Thu, 1 Sep 2022 15:00:38 +0700, John Naylor <john.naylor@enterprisedb.com> wrote in
> > > UnicodeData.txt has this:
> > >
> > > 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
> > > 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
> > > 200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
> > > 200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;;
> > > 200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;;
> > >
> > > So maybe we need to take Cf characters in this file into account, in
> > > addition to Me and Mn (combining characters).
> >
> > Including them into unicode_combining_table.h actually worked, but I'm
> > not sure it is valid to include Cf's among Mn/Me's..

Looking at the definition, Cf means "other, format" category, "Format
character that affects the layout of text or the operation of text
processes, but is not normally rendered". [1]

> UnicodeData.txt
>     174:00AD;SOFT HYPHEN;Cf;0;BN;;;;;N;;;;;
>
> Soft-hyphen seems like not zero-width.. usually...

I gather it only appears at line breaks, which I doubt we want to handle.

>  0600;ARABIC NUMBER SIGN;Cf;0;AN;;;;;N;;;;;
> 110BD;KAITHI NUMBER SIGN;Cf;0;L;;;;;N;;;;;
>
> Mmm. These looks like not zero-width?

There are glyphs, but there is something special about the first one:

select U&'\0600';

Looks like this in psql (substituting 'X' to avoid systemic differences):

+----------+
| ?column? |
+----------+
| X       |
+----------+
(1 row)

Copy from psql to vim or nano:

+----------+
| ?column? |
+----------+
| X        |
+----------+
(1 row)

...so it does mess up the border the same way. The second
(U&'\+0110bd') doesn't render for me.

> However, it seems like basically a win if we include "Cf"s to the
> "combining" table?

There seems to be a case for that. If we did include those, we should
rename the table to match.

I found this old document from 2002 on "default ignorable" characters
that normally have no visible glyph:

https://unicode.org/L2/L2002/02368-default-ignorable.html

If there is any doubt about including all of Cf, we could also just
add a branch in wchar.c to hard-code the 200B-200F range.

-- 
John Naylor
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: Add tracking of backend memory allocated to pg_stat_activity
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: test_decoding assertion failure for the loss of top-sub transaction relationship