Thread: support for DIN SPEC 91379 encoding

support for DIN SPEC 91379 encoding

From
Marco Lechner
Date:

Hi,

 

Does anyone here know, if postgresql supports DIN SPEC 91379 encoding?

 

As far as I understand it is a “new” encoding supporting all “EU characters” based on Unicode, but is not compliant to UTF-8. As far as I di dread, there are a few characters in DIN SOEC 91379 that are not within UTF-8. As DIN SPEC 91379 seems to be a national specification (DE) it is based on a European law. I guess other similar international or aat least European compliant encodings should exist or at least other national specs that are compliant with the german DIN SOEC 91379.

 

i.A. Dr. Marco Lechner

Leiter Fachgebiet RN 1 │ Head RN 1

 

--

Bundesamt für Strahlenschutz │ Federal Office for Radiation Protection

Koordination Notfallschutzsysteme │ Coordination Emergency Systems │ RN 1

Rosastr. 9

D-79098 Freiburg

 

Tel.: +49 30 18333-6724

E-Mail: mlechner@bfs.de

www.bfs.de

🌐 Besuchen Sie unsere Website, folgen Sie uns auf Twitter und abonnieren Sie unseren 📢 Newsletter.

🔒 Informationen zum Datenschutz gemäß Artikel 13 DSGVO

💚 E-Mail drucken? Lieber die Umwelt schonen!

 

--

Hinweis zu Anhängen die auf .p7m/.p7c/.p7s oder .asc/.asc.sig enden:
Die .p7?- und .asc-Dateien sind ungefährliche Signaturdateien (digitale Unterschriften). In E-Mail-Clients mit S/MIME Konfiguration (.p7?) oder PGP-Erweiterung (.asc) dienen sie zur:
- Überprüfung des Absenders
- Überprüfung einer evtl. Veränderung des Inhalts während der Übermittlung über das Internet
Die Signaturdateien können ebenso dazu verwendet werden dem Absender dieser Signatur eine E-Mail mit verschlüsseltem Inhalt zu senden. In E-Mail-Clients ohne S/MIME Konfiguration oder PGP-Erweiterung erscheinen die Dateien als Anhang und können ignoriert werden.

 

Re: support for DIN SPEC 91379 encoding

From
Ralf Schuchardt
Date:
Hi Marco,

On 27 Mar 2022, at 12:54, Marco Lechner wrote:

> Hi,
>
> Does anyone here know, if postgresql supports DIN SPEC 91379 encoding?
>
> As far as I understand it is a “new” encoding supporting all “EU characters” based on Unicode, but is not compliant
toUTF-8.  As far as I di dread, there are a few characters in DIN SOEC 
> 91379 that are not within UTF-8.

where did you read, that this DIN SPEC 91379 norm is incompatible with UTF-8?

In the document „String.Latin+ 1.2: eine kommentierte und erweiterte Fassung der DIN SPEC 91379. Inklusive einer
umfangreichenListe häufig gestellter Fragen. Herausgegeben von der Fachgruppe String.Latin“ linked here
https://www.xoev.de/downloads-2316#StringLatinit is said, that the spec is a strict subset of unicode (E.1.6), and it
isalso mentioned in E.1.4, that in UTF-8 all unicode characters can be encoded. Therefore UTF-8 can be used to encode
allDIN SPEC 91379 characters. 
On the other hand UTF-8 strings may have characters not included in the DIN SPEC.

Ralf

> As DIN SPEC 91379 seems to be a national specification (DE) it is based on a European law. I guess other similar
internationalor aat least European compliant encodings should exist or at least other national specs that are compliant
withthe german DIN SOEC 91379. 
>
> i.A. Dr. Marco Lechner
> Leiter Fachgebiet RN 1 │ Head RN 1
>
> --
> Bundesamt für Strahlenschutz │ Federal Office for Radiation Protection
> Koordination Notfallschutzsysteme │ Coordination Emergency Systems │ RN 1
> Rosastr. 9
> D-79098 Freiburg
>
> Tel.: +49 30 18333-6724
> E-Mail: mlechner@bfs.de<mailto:mlechner@bfs.de>
> www.bfs.de<http://www.bfs.de/>
> 🌐 Besuchen<https://www.bfs.de/> Sie unsere Website, folgen Sie uns auf
Twitter<https://www.twitter.com/strahlenschutz>und abonnieren<https://www.bfs.de/strahlenschutzaktuell> Sie unseren 📢
Newsletter.
> 🔒 Informationen zum Datenschutz<https://www.bfs.de/datenschutz> gemäß Artikel 13 DSGVO
> 💚 E-Mail drucken? Lieber die Umwelt schonen!
>
> --
> Hinweis zu Anhängen die auf .p7m/.p7c/.p7s oder .asc/.asc.sig enden:
> Die .p7?- und .asc-Dateien sind ungefährliche Signaturdateien (digitale Unterschriften). In E-Mail-Clients mit S/MIME
Konfiguration(.p7?) oder PGP-Erweiterung (.asc) dienen sie zur: 
> - Überprüfung des Absenders
> - Überprüfung einer evtl. Veränderung des Inhalts während der Übermittlung über das Internet
> Die Signaturdateien können ebenso dazu verwendet werden dem Absender dieser Signatur eine E-Mail mit verschlüsseltem
Inhaltzu senden. In E-Mail-Clients ohne S/MIME Konfiguration oder PGP-Erweiterung erscheinen die Dateien als Anhang und
könnenignoriert werden. 



Re: support for DIN SPEC 91379 encoding

From
Alvaro Herrera
Date:
On 2022-Mar-27, Ralf Schuchardt wrote:

> where did you read, that this DIN SPEC 91379 norm is incompatible with UTF-8?
> 
> In the document „String.Latin+ 1.2: eine kommentierte und erweiterte
> Fassung der DIN SPEC 91379. Inklusive einer umfangreichen Liste häufig
> gestellter Fragen. Herausgegeben von der Fachgruppe String.Latin“
> linked here https://www.xoev.de/downloads-2316#StringLatin it is said,
> that the spec is a strict subset of unicode (E.1.6), and it is also
> mentioned in E.1.4, that in UTF-8 all unicode characters can be
> encoded. Therefore UTF-8 can be used to encode all DIN SPEC 91379
> characters.

So the remaining question is whether DIN SPEC 91379 requires an
implementation to support character U+0000.  If it does, then PostgreSQL
is not conformant, because that character is the only one in Unicode
that we don't support.  If U+0000 is not required, then PostgreSQL is
okay.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: support for DIN SPEC 91379 encoding

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2022-Mar-27, Ralf Schuchardt wrote:
>> linked here https://www.xoev.de/downloads-2316#StringLatin it is said,
>> that the spec is a strict subset of unicode (E.1.6), and it is also
>> mentioned in E.1.4, that in UTF-8 all unicode characters can be
>> encoded. Therefore UTF-8 can be used to encode all DIN SPEC 91379
>> characters.

> So the remaining question is whether DIN SPEC 91379 requires an
> implementation to support character U+0000.  If it does, then PostgreSQL
> is not conformant, because that character is the only one in Unicode
> that we don't support.  If U+0000 is not required, then PostgreSQL is
> okay.

Hmm ... UTF8 as defined in RFC3629/STD63 [1] does not allow "all unicode
characters to be encoded".  It disallows surrogate pairs (U+D800--U+DFFF)
and code points above U+10FFFF.  We follow that spec, so depending on what
DIN 91379 *actually* says, we might have additional reasons not to be in
compliance.  I don't read German unfortunately.

            regards, tom lane

[1] http://www.faqs.org/rfcs/rfc3629.html



Re: support for DIN SPEC 91379 encoding

From
"Bzm@g"
Date:
U+0000 is not part of DIN SPEC 91379.

--
Boris


> Am 27.03.2022 um 19:47 schrieb Alvaro Herrera <alvherre@alvh.no-ip.org>:
>
> On 2022-Mar-27, Ralf Schuchardt wrote:
>
>> where did you read, that this DIN SPEC 91379 norm is incompatible with UTF-8?
>>
>> In the document „String.Latin+ 1.2: eine kommentierte und erweiterte
>> Fassung der DIN SPEC 91379. Inklusive einer umfangreichen Liste häufig
>> gestellter Fragen. Herausgegeben von der Fachgruppe String.Latin“
>> linked here https://www.xoev.de/downloads-2316#StringLatin it is said,
>> that the spec is a strict subset of unicode (E.1.6), and it is also
>> mentioned in E.1.4, that in UTF-8 all unicode characters can be
>> encoded. Therefore UTF-8 can be used to encode all DIN SPEC 91379
>> characters.
>
> So the remaining question is whether DIN SPEC 91379 requires an
> implementation to support character U+0000.  If it does, then PostgreSQL
> is not conformant, because that character is the only one in Unicode
> that we don't support.  If U+0000 is not required, then PostgreSQL is
> okay.
>
> --
> Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>
>




Re: support for DIN SPEC 91379 encoding

From
"Peter J. Holzer"
Date:
On 2022-03-27 14:06:25 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2022-Mar-27, Ralf Schuchardt wrote:
> >> linked here https://www.xoev.de/downloads-2316#StringLatin it is said,
> >> that the spec is a strict subset of unicode (E.1.6), and it is also
> >> mentioned in E.1.4, that in UTF-8 all unicode characters can be
> >> encoded. Therefore UTF-8 can be used to encode all DIN SPEC 91379
> >> characters.
>
> > So the remaining question is whether DIN SPEC 91379 requires an
> > implementation to support character U+0000.  If it does, then PostgreSQL
> > is not conformant, because that character is the only one in Unicode
> > that we don't support.  If U+0000 is not required, then PostgreSQL is
> > okay.
>
> Hmm ... UTF8 as defined in RFC3629/STD63 [1] does not allow "all unicode
> characters to be encoded".  It disallows surrogate pairs (U+D800--U+DFFF)
> and code points above U+10FFFF.

From section 2.4 Code Points and Characters of the Unicode Standard,
Version 14.0 - Core Specification:

| In the Unicode Standard, the codespace consists of the integers from 0
| to 10FFFF 16, com- prising 1,114,112 code points available for
| assigning the repertoire of abstract characters.

So there are no characters above U+10FFFF.

Also,

| Not all assigned code points represent abstract characters; only
| Graphic, Format, Control and Private-use do. Surrogates and
| Noncharacters are assigned code points but are not assigned to
| abstract characters.

So Surrogates aren't characters either.

UTF-8 can indeed be used to encode "all unicode characters".

> We follow that spec, so depending on what DIN 91379 *actually* says,
> we might have additional reasons not to be in compliance.  I don't
> read German unfortunately.

It defines minimal character set that IT systems which process personal
and company names in the EU must accept. Basically Latin, Greek and
Cyrillic letters, digits and some symbols and interpunctation.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: support for DIN SPEC 91379 encoding

From
Alvaro Herrera
Date:
On 2022-Mar-28, Peter J. Holzer wrote:

> On 2022-03-27 14:06:25 -0400, Tom Lane wrote:

> > We follow that spec, so depending on what DIN 91379 *actually* says,
> > we might have additional reasons not to be in compliance.  I don't
> > read German unfortunately.
> 
> It defines minimal character set that IT systems which process personal
> and company names in the EU must accept. Basically Latin, Greek and
> Cyrillic letters, digits and some symbols and interpunctation.

Yeah, I had a look at the list of allowed characters and it's a
reasonably simple set.  The most complex you can find is stuff like

LATIN CAPITAL LETTER R WITH COMBINING RING BELOW AND COMBINING MACRON
LATIN CAPITAL LETTER K WITH COMBINING DOUBLE MACRON BELOW AND LATIN SMALL LETTER H

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)