IANA timezone abbreviations versus timezone_abbreviations - Mailing list pgsql-hackers

From Tom Lane
Subject IANA timezone abbreviations versus timezone_abbreviations
Date
Msg-id 957280.1734046384@sss.pgh.pa.us
Whole thread Raw
Responses Re: IANA timezone abbreviations versus timezone_abbreviations
List pgsql-hackers
Although we've never documented this, it's been true for ages
that timestamptz_out shows timezone abbreviations that are taken
from the IANA tzdb data (in datestyles that use non-numeric
timezone fields, which is all but ISO).  Meanwhile, timestamptz_in
recognizes timezone abbreviations if they match an entry in the
timezone_abbreviations file.  Those two sources of truth are not
entirely consistent, leading to fun results like these:

regression=# set datestyle = postgres;
SET
regression=# set timezone = 'America/Montevideo';
SET
regression=# select '1900-01-01 00:00'::timestamptz;
         timestamptz
------------------------------
 Mon Jan 01 00:00:00 1900 LMT
(1 row)

regression=# select '1900-01-01 00:00'::timestamptz::text::timestamptz;
ERROR:  invalid input syntax for type timestamp with time zone: "Mon Jan 01 00:00:00 1900 LMT"

(because LMT for "local mean time" is not in timezone_abbreviations)

regression=# select '1912-01-01 00:00'::timestamptz;
         timestamptz
------------------------------
 Mon Jan 01 00:00:00 1912 MMT
(1 row)

regression=# select '1912-01-01 00:00'::timestamptz::text::timestamptz;
         timestamptz
------------------------------
 Sun Dec 31 13:45:09 1911 MMT
(1 row)

(because this zone uses MMT for "Montevideo Mean Time" while
timezone_abbreviations thinks it means "Myanmar Time").

You can get unfortunate results even for current timestamps,
because we've not hesitated to put some North-American-centric
interpretations into the default abbreviations list:

regression=# set timezone = 'Asia/Shanghai';
SET
regression=# select '2024-12-13 00:00'::timestamptz;
         timestamptz
------------------------------
 Fri Dec 13 00:00:00 2024 CST
(1 row)

regression=# select '2024-12-13 00:00'::timestamptz::text::timestamptz;
         timestamptz
------------------------------
 Fri Dec 13 14:00:00 2024 CST

(because this IANA zone uses CST for "China Standard Time"
while timezone_abbreviations thinks it means (USA) "Central
Standard Time").

This mess was brought up in pgsql-bugs [1], but the solution
I propose here is invasive enough that I think it needs
discussion on -hackers.

What I think we should do about this is to teach timestamp
input to look into the current IANA time zone to see if it
knows the given abbreviation, and if so use that meaning
regardless of what timezone_abbreviations might say.  This
isn't particularly hard, and it doesn't appear to cost
anything speed-wise, but is there anybody out there who
is relying on the current behavior?

I can imagine that somebody might be using an interpretation
that is contrary to IANA's ideas; but it seems fairly unlikely
with current IANA data, because they largely got rid of the
made-up abbreviations their data used to be full of.  Anyplace
where we find a non-numeric abbreviation in the IANA data is
probably someplace where that abbreviation is widely current,
and people wouldn't expect it to mean something different.

On the positive side, this gives us a far better story for
abbreviations that conflict in different parts of the world.
timestamptz_in will now automatically do the right thing
given a correct timezone setting, without having to manually
adjust the abbreviation list.  So between that and getting
rid of the round-trip hazards seen above, I think there is
sufficient reason to do this.

The only other way I can envision to remove the round-trip hazard
is to stop using alphabetic abbreviations at all in timestamp
output, and use numeric GMT offsets regardless of datestyle.
I doubt that would make many people happy.  It would certainly
break a bunch of our own regression tests, and I expect it would
break other people's applications too.

(To be clear, I'm only proposing this for v18 not for back-patch.
While it's certainly fixing live bugs, there have not been that
many complaints, and the behavioral change is surely more than
we want for a back branch.)

Draft patches attached.  Any thoughts?

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/CAJ7c6TOATjJqvhnYsui0%3DCO5XFMF4dvTGH%2BskzB--jNhqSQu5g%40mail.gmail.com

From cd36b9f4916abdcc2594ed7529941c787ca10fa4 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 12 Dec 2024 17:39:44 -0500
Subject: [PATCH v1 1/2] Seek zone abbreviations in the IANA data before
 timezone_abbreviations.

If a time zone abbreviation used in datetime input is defined in
the currently active timezone, use that definition in preference
to looking in the timezone_abbreviations list.  That allows us to
correctly handle abbreviations that have different meanings in
different timezones.  Also, it eliminates an inconsistency between
datetime input and datetime output: the non-ISO datestyles for
timestamptz have always printed abbreviations taken from the IANA
data not from timezone_abbreviations.  Before this fix, it was
possible to demonstrate cases where casting a timestamp to text
and back fails or changes the value significantly because of that
inconsistency.

While this change removes the ability to override the IANA data about
an abbreviation known in the current zone, it's not clear that there's
any real use-case for doing so.  But it is clear that this makes life
a lot easier for dealing with abbreviations that have conflicts.

There are a couple of loose ends still to deal with:

* As this patch stands, it causes a noticeable degradation of the
runtime of timestamptz_in (about 20% in a microbenchmark of just
that function).  This is from DecodeTimezoneAbbrev not caching
the results of its lookup in the new path.  I split out the
improvement of that part for a follow-up patch.

* The pg_timezone_abbrevs view shows only abbreviations from
the timezone_abbreviations list.  That should probably be
adjusted to account for abbreviations taken from the timezone.

Per report from Aleksander Alekseev and additional investigation.

Discussion: https://postgr.es/m/CAJ7c6TOATjJqvhnYsui0=CO5XFMF4dvTGH+skzB--jNhqSQu5g@mail.gmail.com
---
 doc/src/sgml/datatype.sgml                |  4 +
 doc/src/sgml/datetime.sgml                | 14 +++-
 src/backend/utils/adt/datetime.c          | 89 +++++++++++++++++++++--
 src/include/pgtime.h                      |  5 ++
 src/test/regress/expected/horology.out    |  6 ++
 src/test/regress/expected/timestamptz.out | 59 +++++++++++++++
 src/test/regress/sql/horology.sql         |  1 +
 src/test/regress/sql/timestamptz.sql      | 17 +++++
 src/timezone/localtime.c                  | 76 +++++++++++++++++++
 9 files changed, 265 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 3e6751d64c..1d9127e94e 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -2534,6 +2534,10 @@ TIMESTAMP WITH TIME ZONE '2004-10-19 10:23:54+02'
      abbreviation if one is in common use in the current zone.  Otherwise
      it appears as a signed numeric offset in ISO 8601 basic format
      (<replaceable>hh</replaceable> or <replaceable>hhmm</replaceable>).
+     The alphabetic abbreviations shown in these styles are taken from the
+     IANA time zone database entry currently selected by the
+     <xref linkend="guc-timezone"/> run-time parameter; they are not
+     affected by the <xref linkend="guc-timezone-abbreviations"/> setting.
     </para>

     <para>
diff --git a/doc/src/sgml/datetime.sgml b/doc/src/sgml/datetime.sgml
index e7035c7806..eb8760c1f0 100644
--- a/doc/src/sgml/datetime.sgml
+++ b/doc/src/sgml/datetime.sgml
@@ -424,7 +424,7 @@
    <para>
     Since timezone abbreviations are not well standardized,
     <productname>PostgreSQL</productname> provides a means to customize
-    the set of abbreviations accepted by the server.  The
+    the set of abbreviations accepted in datetime input.  The
     <xref linkend="guc-timezone-abbreviations"/> run-time parameter
     determines the active set of abbreviations.  While this parameter
     can be altered by any database user, the possible values for it
@@ -444,6 +444,18 @@
     backup files and other extraneous files.)
    </para>

+   <para>
+    Before consulting the <varname>timezone_abbreviations</varname> file,
+    <productname>PostgreSQL</productname> checks to see whether an
+    abbreviation used in datetime input is defined in the IANA time zone
+    database entry currently selected by the
+    <xref linkend="guc-timezone"/> run-time parameter.  If so the time
+    zone's meaning is used, for consistency with datetime output.  The
+    <varname>timezone_abbreviations</varname> file is mainly useful for
+    allowing datetime input to recognize abbreviations for time zones
+    other than the active zone.
+   </para>
+
    <para>
     A timezone abbreviation file can contain blank lines and comments
     beginning with <literal>#</literal>.  Non-comment lines must have one of
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 0b19cddf54..e6b3ac134d 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -1845,6 +1845,40 @@ DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr, pg_tz *tzp,
 }


+/* TimeZoneAbbrevIsKnown()
+ *
+ * Detect whether the given string is a time zone abbreviation that's known
+ * in the specified TZDB timezone, and if so whether it's fixed or varying
+ * meaning.  The match is not case-sensitive.
+ */
+static bool
+TimeZoneAbbrevIsKnown(const char *abbr, pg_tz *tzp,
+                      bool *isfixed, int *offset, int *isdst)
+{
+    char        upabbr[TZ_STRLEN_MAX + 1];
+    unsigned char *p;
+    long int    gmtoff;
+
+    /* We need to force the abbrev to upper case */
+    strlcpy(upabbr, abbr, sizeof(upabbr));
+    for (p = (unsigned char *) upabbr; *p; p++)
+        *p = pg_toupper(*p);
+
+    /* Look up the abbrev's meaning in this zone */
+    if (pg_timezone_abbrev_is_known(upabbr,
+                                    isfixed,
+                                    &gmtoff,
+                                    isdst,
+                                    tzp))
+    {
+        /* Change sign to agree with DetermineTimeZoneOffset() */
+        *offset = (int) -gmtoff;
+        return true;
+    }
+    return false;
+}
+
+
 /* DecodeTimeOnly()
  * Interpret parsed string as time fields only.
  * Returns 0 if successful, DTERR code if bogus input detected.
@@ -3092,8 +3126,28 @@ DecodeTimezoneAbbrev(int field, const char *lowtoken,
                      int *ftype, int *offset, pg_tz **tz,
                      DateTimeErrorExtra *extra)
 {
+    bool        isfixed;
+    int            isdst;
     const datetkn *tp;

+    /*
+     * See if the current session_timezone recognizes it.  Checking this
+     * before zoneabbrevtbl allows us to correctly handle abbreviations whose
+     * meaning varies across zones, such as "LMT".  (Caching this lookup is
+     * left for later.)
+     */
+    if (session_timezone &&
+        TimeZoneAbbrevIsKnown(lowtoken, session_timezone,
+                              &isfixed, offset, &isdst))
+    {
+        *ftype = (isfixed ? (isdst ? DTZ : TZ) : DYNTZ);
+        *tz = (isfixed ? NULL : session_timezone);
+        /* flip sign to agree with the convention used in zoneabbrevtbl */
+        *offset = -(*offset);
+        return 0;
+    }
+
+    /* Nope, so look in zoneabbrevtbl */
     tp = abbrevcache[field];
     /* use strncmp so that we match truncated tokens */
     if (tp == NULL || strncmp(lowtoken, tp->token, TOKMAXLEN) != 0)
@@ -3109,6 +3163,7 @@ DecodeTimezoneAbbrev(int field, const char *lowtoken,
         *ftype = UNKNOWN_FIELD;
         *offset = 0;
         *tz = NULL;
+        /* failure results are not cached */
     }
     else
     {
@@ -3278,9 +3333,6 @@ DecodeTimezoneAbbrevPrefix(const char *str, int *offset, pg_tz **tz)
     *offset = 0;                /* avoid uninitialized vars on failure */
     *tz = NULL;

-    if (!zoneabbrevtbl)
-        return -1;                /* no abbrevs known, so fail immediately */
-
     /* Downcase as much of the string as we could need */
     for (len = 0; len < TOKMAXLEN; len++)
     {
@@ -3299,9 +3351,34 @@ DecodeTimezoneAbbrevPrefix(const char *str, int *offset, pg_tz **tz)
      */
     while (len > 0)
     {
-        const datetkn *tp = datebsearch(lowtoken, zoneabbrevtbl->abbrevs,
-                                        zoneabbrevtbl->numabbrevs);
+        bool        isfixed;
+        int            isdst;
+        const datetkn *tp;
+
+        /* See if the current session_timezone recognizes it. */
+        if (session_timezone &&
+            TimeZoneAbbrevIsKnown(lowtoken, session_timezone,
+                                  &isfixed, offset, &isdst))
+        {
+            if (isfixed)
+            {
+                /* flip sign to agree with the convention in zoneabbrevtbl */
+                *offset = -(*offset);
+            }
+            else
+            {
+                /* Caller must resolve the abbrev's current meaning */
+                *tz = session_timezone;
+            }
+            return len;
+        }

+        /* Known in zoneabbrevtbl? */
+        if (zoneabbrevtbl)
+            tp = datebsearch(lowtoken, zoneabbrevtbl->abbrevs,
+                             zoneabbrevtbl->numabbrevs);
+        else
+            tp = NULL;
         if (tp != NULL)
         {
             if (tp->type == DYNTZ)
@@ -3324,6 +3401,8 @@ DecodeTimezoneAbbrevPrefix(const char *str, int *offset, pg_tz **tz)
                 return len;
             }
         }
+
+        /* Nope, try the next shorter string. */
         lowtoken[--len] = '\0';
     }

diff --git a/src/include/pgtime.h b/src/include/pgtime.h
index 2742086711..a20b5019d8 100644
--- a/src/include/pgtime.h
+++ b/src/include/pgtime.h
@@ -69,6 +69,11 @@ extern bool pg_interpret_timezone_abbrev(const char *abbrev,
                                          long int *gmtoff,
                                          int *isdst,
                                          const pg_tz *tz);
+extern bool pg_timezone_abbrev_is_known(const char *abbrev,
+                                        bool *isfixed,
+                                        long int *gmtoff,
+                                        int *isdst,
+                                        const pg_tz *tz);
 extern bool pg_get_timezone_offset(const pg_tz *tz, long int *gmtoff);
 extern const char *pg_get_timezone_name(pg_tz *tz);
 extern bool pg_tz_acceptable(pg_tz *tz);
diff --git a/src/test/regress/expected/horology.out b/src/test/regress/expected/horology.out
index cb28dfbaee..b90bfcd794 100644
--- a/src/test/regress/expected/horology.out
+++ b/src/test/regress/expected/horology.out
@@ -3332,6 +3332,12 @@ SELECT to_timestamp('2011-12-18 11:38 MSK', 'YYYY-MM-DD HH12:MI TZ');  -- dyntz
  Sat Dec 17 23:38:00 2011 PST
 (1 row)

+SELECT to_timestamp('2011-12-18 00:00 LMT', 'YYYY-MM-DD HH24:MI TZ');  -- dyntz
+         to_timestamp
+------------------------------
+ Sat Dec 17 23:52:58 2011 PST
+(1 row)
+
 SELECT to_timestamp('2011-12-18 11:38ESTFOO24', 'YYYY-MM-DD HH12:MITZFOOSS');
          to_timestamp
 ------------------------------
diff --git a/src/test/regress/expected/timestamptz.out b/src/test/regress/expected/timestamptz.out
index b437613ac8..be01cfc76f 100644
--- a/src/test/regress/expected/timestamptz.out
+++ b/src/test/regress/expected/timestamptz.out
@@ -176,6 +176,65 @@ SELECT '205000-01-10 17:32:01 Europe/Helsinki'::timestamptz; -- non-DST
  Fri Jan 10 07:32:01 205000 PST
 (1 row)

+-- Recognize "LMT" as whatever it means in the current zone
+SELECT 'Jan 01 00:00:00 1000 LMT'::timestamptz;
+         timestamptz
+------------------------------
+ Wed Jan 01 00:00:00 1000 LMT
+(1 row)
+
+SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;
+         timestamptz
+------------------------------
+ Sun Dec 31 23:52:58 2023 PST
+(1 row)
+
+SET timezone = 'Europe/London';
+SELECT 'Jan 01 00:00:00 1000 LMT'::timestamptz;
+         timestamptz
+------------------------------
+ Wed Jan 01 00:00:00 1000 LMT
+(1 row)
+
+SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;
+         timestamptz
+------------------------------
+ Mon Jan 01 00:01:15 2024 GMT
+(1 row)
+
+-- which might be nothing
+SET timezone = 'UTC';
+SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;  -- fail
+ERROR:  invalid input syntax for type timestamp with time zone: "Jan 01 00:00:00 2024 LMT"
+LINE 1: SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;
+               ^
+-- Another example of an abbrev that varies across zones
+SELECT '1912-01-01 00:00 MMT'::timestamptz;  -- from timezone_abbreviations
+         timestamptz
+------------------------------
+ Sun Dec 31 17:30:00 1911 UTC
+(1 row)
+
+SET timezone = 'America/Montevideo';
+SELECT '1912-01-01 00:00'::timestamptz;
+         timestamptz
+------------------------------
+ Mon Jan 01 00:00:00 1912 MMT
+(1 row)
+
+SELECT '1912-01-01 00:00 MMT'::timestamptz;
+         timestamptz
+------------------------------
+ Mon Jan 01 00:00:00 1912 MMT
+(1 row)
+
+SELECT '1912-01-01 00:00 MMT'::timestamptz AT TIME ZONE 'UTC';
+         timezone
+--------------------------
+ Mon Jan 01 03:44:51 1912
+(1 row)
+
+RESET timezone;
 -- Test non-error-throwing API
 SELECT pg_input_is_valid('now', 'timestamptz');
  pg_input_is_valid
diff --git a/src/test/regress/sql/horology.sql b/src/test/regress/sql/horology.sql
index 4aa88b4ba9..1310b43277 100644
--- a/src/test/regress/sql/horology.sql
+++ b/src/test/regress/sql/horology.sql
@@ -538,6 +538,7 @@ SELECT to_timestamp('2011-12-18 11:38 EST', 'YYYY-MM-DD HH12:MI TZ');
 SELECT to_timestamp('2011-12-18 11:38 -05', 'YYYY-MM-DD HH12:MI TZ');
 SELECT to_timestamp('2011-12-18 11:38 +01:30', 'YYYY-MM-DD HH12:MI TZ');
 SELECT to_timestamp('2011-12-18 11:38 MSK', 'YYYY-MM-DD HH12:MI TZ');  -- dyntz
+SELECT to_timestamp('2011-12-18 00:00 LMT', 'YYYY-MM-DD HH24:MI TZ');  -- dyntz
 SELECT to_timestamp('2011-12-18 11:38ESTFOO24', 'YYYY-MM-DD HH12:MITZFOOSS');
 SELECT to_timestamp('2011-12-18 11:38-05FOO24', 'YYYY-MM-DD HH12:MITZFOOSS');
 SELECT to_timestamp('2011-12-18 11:38 JUNK', 'YYYY-MM-DD HH12:MI TZ');  -- error
diff --git a/src/test/regress/sql/timestamptz.sql b/src/test/regress/sql/timestamptz.sql
index 6b91e7eddc..7f38a23d2d 100644
--- a/src/test/regress/sql/timestamptz.sql
+++ b/src/test/regress/sql/timestamptz.sql
@@ -109,6 +109,23 @@ SELECT '20500110 173201 Europe/Helsinki'::timestamptz; -- non-DST
 SELECT '205000-07-10 17:32:01 Europe/Helsinki'::timestamptz; -- DST
 SELECT '205000-01-10 17:32:01 Europe/Helsinki'::timestamptz; -- non-DST

+-- Recognize "LMT" as whatever it means in the current zone
+SELECT 'Jan 01 00:00:00 1000 LMT'::timestamptz;
+SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;
+SET timezone = 'Europe/London';
+SELECT 'Jan 01 00:00:00 1000 LMT'::timestamptz;
+SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;
+-- which might be nothing
+SET timezone = 'UTC';
+SELECT 'Jan 01 00:00:00 2024 LMT'::timestamptz;  -- fail
+-- Another example of an abbrev that varies across zones
+SELECT '1912-01-01 00:00 MMT'::timestamptz;  -- from timezone_abbreviations
+SET timezone = 'America/Montevideo';
+SELECT '1912-01-01 00:00'::timestamptz;
+SELECT '1912-01-01 00:00 MMT'::timestamptz;
+SELECT '1912-01-01 00:00 MMT'::timestamptz AT TIME ZONE 'UTC';
+RESET timezone;
+
 -- Test non-error-throwing API
 SELECT pg_input_is_valid('now', 'timestamptz');
 SELECT pg_input_is_valid('garbage', 'timestamptz');
diff --git a/src/timezone/localtime.c b/src/timezone/localtime.c
index 0bc160ea7d..65511ae8be 100644
--- a/src/timezone/localtime.c
+++ b/src/timezone/localtime.c
@@ -1843,6 +1843,82 @@ pg_interpret_timezone_abbrev(const char *abbrev,
     return false;                /* hm, not actually used in any interval? */
 }

+/*
+ * Detect whether a timezone abbreviation is defined within the given zone.
+ *
+ * This is similar to pg_interpret_timezone_abbrev() but is not concerned
+ * with a specific point in time.  We want to know if the abbreviation is
+ * known at all, and if so whether it has one meaning or several.
+ *
+ * Returns true if the abbreviation is known, false if not.
+ * If the abbreviation is known and has a single meaning (only one value
+ * of gmtoff/isdst), sets *isfixed = true and sets *gmtoff and *isdst.
+ * If there are multiple meanings, sets *isfixed = false.
+ *
+ * Note: abbrev is matched case-sensitively; it should be all-upper-case.
+ */
+bool
+pg_timezone_abbrev_is_known(const char *abbrev,
+                            bool *isfixed,
+                            long int *gmtoff,
+                            int *isdst,
+                            const pg_tz *tz)
+{
+    bool        result = false;
+    const struct state *sp = &tz->state;
+    const char *abbrs;
+    int            abbrind;
+
+    /*
+     * Locate the abbreviation in the zone's abbreviation list.  We assume
+     * there are not duplicates in the list.
+     */
+    abbrs = sp->chars;
+    abbrind = 0;
+    while (abbrind < sp->charcnt)
+    {
+        if (strcmp(abbrev, abbrs + abbrind) == 0)
+            break;
+        while (abbrs[abbrind] != '\0')
+            abbrind++;
+        abbrind++;
+    }
+    if (abbrind >= sp->charcnt)
+        return false;            /* definitely not there */
+
+    /*
+     * Scan the ttinfo array to find uses of the abbreviation.
+     */
+    for (int i = 0; i < sp->typecnt; i++)
+    {
+        const struct ttinfo *ttisp = &sp->ttis[i];
+
+        if (ttisp->tt_desigidx == abbrind)
+        {
+            if (!result)
+            {
+                /* First usage */
+                *isfixed = true;    /* for the moment */
+                *gmtoff = ttisp->tt_utoff;
+                *isdst = ttisp->tt_isdst;
+                result = true;
+            }
+            else
+            {
+                /* Second or later usage, does it match? */
+                if (*gmtoff != ttisp->tt_utoff ||
+                    *isdst != ttisp->tt_isdst)
+                {
+                    *isfixed = false;
+                    break;        /* no point in looking further */
+                }
+            }
+        }
+    }
+
+    return result;
+}
+
 /*
  * If the given timezone uses only one GMT offset, store that offset
  * into *gmtoff and return true, else return false.
--
2.43.5

From 3d50373d5332ae89b6321886ddf2cf74db6e72cd Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu, 12 Dec 2024 17:54:15 -0500
Subject: [PATCH v1 2/2] Improve DecodeTimezoneAbbrev's caching logic.

The previous implementation could only cope with caching a
result found in the zoneabbrevtbl.  It is worth expending
a little more space to be able to cache results obtained from
the IANA timezone data, especially since that's likely to be
the majority use-case going forward.

To do this, we have to reset the cache after a change in
the timezone GUC not only the timezone_abbrev GUC, but that's
not hard.

In my testing, this puts the speed of repeated timestamptz_in
calls back on par with what it was before the previous patch.

Per report from Aleksander Alekseev and additional investigation.

Discussion: https://postgr.es/m/CAJ7c6TOATjJqvhnYsui0=CO5XFMF4dvTGH+skzB--jNhqSQu5g@mail.gmail.com
---
 src/backend/commands/variable.c  |  2 +
 src/backend/utils/adt/datetime.c | 69 ++++++++++++++++++++++++--------
 src/include/utils/datetime.h     |  2 +
 src/tools/pgindent/typedefs.list |  1 +
 4 files changed, 58 insertions(+), 16 deletions(-)

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index d2db0c45b2..794d2a0fce 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -381,6 +381,8 @@ void
 assign_timezone(const char *newval, void *extra)
 {
     session_timezone = *((pg_tz **) extra);
+    /* datetime.c's cache of timezone abbrevs may now be obsolete */
+    ClearTimeZoneAbbrevCache();
 }

 /*
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index e6b3ac134d..4e6515865b 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -259,7 +259,17 @@ static const datetkn *datecache[MAXDATEFIELDS] = {NULL};

 static const datetkn *deltacache[MAXDATEFIELDS] = {NULL};

-static const datetkn *abbrevcache[MAXDATEFIELDS] = {NULL};
+/* Cache for results of timezone abbreviation lookups */
+
+typedef struct TzAbbrevCache
+{
+    char        abbrev[TOKMAXLEN + 1];    /* always NUL-terminated */
+    char        ftype;            /* TZ, DTZ, or DYNTZ */
+    int            offset;            /* GMT offset, if fixed-offset */
+    pg_tz       *tz;                /* relevant zone, if variable-offset */
+} TzAbbrevCache;
+
+static TzAbbrevCache tzabbrevcache[MAXDATEFIELDS];


 /*
@@ -3126,15 +3136,28 @@ DecodeTimezoneAbbrev(int field, const char *lowtoken,
                      int *ftype, int *offset, pg_tz **tz,
                      DateTimeErrorExtra *extra)
 {
+    TzAbbrevCache *tzc = &tzabbrevcache[field];
     bool        isfixed;
     int            isdst;
     const datetkn *tp;

+    /*
+     * Do we have a cached result?  Use strncmp so that we match truncated
+     * names, although we shouldn't really see that happen with normal
+     * abbreviations.
+     */
+    if (strncmp(lowtoken, tzc->abbrev, TOKMAXLEN) == 0)
+    {
+        *ftype = tzc->ftype;
+        *offset = tzc->offset;
+        *tz = tzc->tz;
+        return 0;
+    }
+
     /*
      * See if the current session_timezone recognizes it.  Checking this
      * before zoneabbrevtbl allows us to correctly handle abbreviations whose
-     * meaning varies across zones, such as "LMT".  (Caching this lookup is
-     * left for later.)
+     * meaning varies across zones, such as "LMT".
      */
     if (session_timezone &&
         TimeZoneAbbrevIsKnown(lowtoken, session_timezone,
@@ -3144,20 +3167,20 @@ DecodeTimezoneAbbrev(int field, const char *lowtoken,
         *tz = (isfixed ? NULL : session_timezone);
         /* flip sign to agree with the convention used in zoneabbrevtbl */
         *offset = -(*offset);
+        /* cache result; use strlcpy to truncate name if necessary */
+        strlcpy(tzc->abbrev, lowtoken, TOKMAXLEN + 1);
+        tzc->ftype = *ftype;
+        tzc->offset = *offset;
+        tzc->tz = *tz;
         return 0;
     }

     /* Nope, so look in zoneabbrevtbl */
-    tp = abbrevcache[field];
-    /* use strncmp so that we match truncated tokens */
-    if (tp == NULL || strncmp(lowtoken, tp->token, TOKMAXLEN) != 0)
-    {
-        if (zoneabbrevtbl)
-            tp = datebsearch(lowtoken, zoneabbrevtbl->abbrevs,
-                             zoneabbrevtbl->numabbrevs);
-        else
-            tp = NULL;
-    }
+    if (zoneabbrevtbl)
+        tp = datebsearch(lowtoken, zoneabbrevtbl->abbrevs,
+                         zoneabbrevtbl->numabbrevs);
+    else
+        tp = NULL;
     if (tp == NULL)
     {
         *ftype = UNKNOWN_FIELD;
@@ -3167,7 +3190,6 @@ DecodeTimezoneAbbrev(int field, const char *lowtoken,
     }
     else
     {
-        abbrevcache[field] = tp;
         *ftype = tp->type;
         if (tp->type == DYNTZ)
         {
@@ -3181,11 +3203,26 @@ DecodeTimezoneAbbrev(int field, const char *lowtoken,
             *offset = tp->value;
             *tz = NULL;
         }
+
+        /* cache result; use strlcpy to truncate name if necessary */
+        strlcpy(tzc->abbrev, lowtoken, TOKMAXLEN + 1);
+        tzc->ftype = *ftype;
+        tzc->offset = *offset;
+        tzc->tz = *tz;
     }

     return 0;
 }

+/*
+ * Reset tzabbrevcache after a change in session_timezone.
+ */
+void
+ClearTimeZoneAbbrevCache(void)
+{
+    memset(tzabbrevcache, 0, sizeof(tzabbrevcache));
+}
+

 /* DecodeSpecial()
  * Decode text string using lookup table.
@@ -5036,8 +5073,8 @@ void
 InstallTimeZoneAbbrevs(TimeZoneAbbrevTable *tbl)
 {
     zoneabbrevtbl = tbl;
-    /* reset abbrevcache, which may contain pointers into old table */
-    memset(abbrevcache, 0, sizeof(abbrevcache));
+    /* reset tzabbrevcache, which may contain results from old table */
+    memset(tzabbrevcache, 0, sizeof(tzabbrevcache));
 }

 /*
diff --git a/src/include/utils/datetime.h b/src/include/utils/datetime.h
index e4ac2b8e7f..79382b9ecd 100644
--- a/src/include/utils/datetime.h
+++ b/src/include/utils/datetime.h
@@ -351,6 +351,8 @@ extern pg_tz *DecodeTimezoneNameToTz(const char *tzname);
 extern int    DecodeTimezoneAbbrevPrefix(const char *str,
                                        int *offset, pg_tz **tz);

+extern void ClearTimeZoneAbbrevCache(void);
+
 extern int    j2day(int date);

 extern struct Node *TemporalSimplify(int32 max_precis, struct Node *node);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ce33e55bf1..a84f26218c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3022,6 +3022,7 @@ TypeCat
 TypeFuncClass
 TypeInfo
 TypeName
+TzAbbrevCache
 U32
 U8
 UChar
--
2.43.5


pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.
Next
From: Peter Smith
Date:
Subject: Re: pg_createsubscriber TAP test wrapping makes command options hard to read.