During FOSDEM I had a chance to join a presentation of the backup procedure that the engineers from GitLab followed to decrease the downtime during major upgrades. I highly recommend the talk, definitely worth watching!
I loved the presentation, especially since a similar procedure is what we recommend our users as well. Unfortunately it’s not for everyone: it’s applicable ONLY when you can stop DDL operations on your database cluster for some time. As you can imagine that’s not a case for every deployment.
This limitation got me into some very engaging discussions during my fav part of any conference, the so called “hallway track”. Networking and discussions after the talks and on the conference corridors are why I like going to such events. I’ve heard some stories of nasty surprises coming out of the unexpected re-indexing after an upgrade.
It made me wonder, how many professionals have moved to PostgreSQL from other areas of expertise and are lacking information about the index rebuilds after upgrades may be necessary. How often are DevOps engineers or SREs, neither fluent in PostgreSQL nor experienced with databases, tasked with maintaining database infrastructure?
In the meantime I figured out that a short coffee time read is what I want to aim at. No super deep dives, rather food for thought and inspiring more of those engaging discussions I like so much about the conferences.
So let’s get to it. This week I want to go through some basic facts before diving into more challenging topics in the coming weeks.
What are collations
In general, a collation defines how values are ordered and compared. In databases, it most commonly applies to text. It’s often not a trivial thing to determine how strings are sorted, whether two values are considered equal, and how things like upper vs lower case or special characters are treated. Just like alphabetical order helps us organize words in everyday life, collations define the rules the database uses when comparing and sorting character data. This allows us to sort structures like strings. Numbers are even simpler.
A good visualization of how significant collations are is when you try to imagine determining whether one item is before another if they come from different alphabets. Where do you position country specific characters in such a sorting order?
Lets look at a set of example data:
Bob
Anna
Zoë
ÁlvaroWhen using an English-like collation the sorting would look like this as Á is treated like A . This is visible in Collation 1:
Anna
Álvaro
Bob
ZoëThough with a collation changed so that Á is sorted separately before A . This is visible in Collation 2:
Álvaro
Anna
Bob
ZoëIt’s clearly visible that collation defines how text is sorted and compared.
What are indexes
If you’re not a database professional, you may not be familiar with what an index is. Think of it as of a structure that speeds up the search. In an old school library you had to check the index of authors to find the book you’ve been looking for. Similarly in databases when knowing what the data is going to be searched for, we introduce indexes.
A common use of indexes is to enforce uniqueness of values. Primary keys and unique constraints automatically create unique indexes to ensure that no duplicate values are inserted. In addition, users can create their own indexes to support specific query patterns. Indexes can be single or multi column, depending on the particular use cases and to help speed up specific queries. Let’s look at an example of physically unsorted data stored in heap
Users (unsorted data)
--------------------
[1] Zoë
[2] Álvaro
[3] Anna
[4] Bobwith an Index on name
Index (B-tree on name)
----------------------
Anna -> [3]
Bob -> [4]
Zoë -> [1]
Álvaro -> [2]The index stores values in sorted order and points to their location in the table. Scanning through the list in order is fine for small number of items, as indexes get larger, there are opportunities to optimize this process. The most commonly used optimisation, very simplified, is to split the list and create a pointer which stores the maximum value in the left half of the list and the minimum value in the right half. As the lists get too large again, they are split again, creating a tree of these pointers.
Why re-indexing is needed
If the rules for comparing the strings are different at query time from what they were when the index was created, the search may fail in various ways. If the scan encounters a value which, by the rules at query time, is higher in the sort order than the value being searched for, it will conclude that the value being searched for is not in the list. Similarly, the search may follow a pointer to the wrong list of values. To make this more concrete, let’s look at a real world example from a glibc change in 2019. Before the change, a list of our values was sorted as follows
aa
a a
a-a
a+aafter the change, the list was sorted as follows:
a a
a+a
a-a
aaIf an index was built under the first set of rules and searched under the second, a search for the value ‘a a’ will fail, because the first item in the index (‘aa’) is higher in the sort order than the value being searched for. Failing to find a value which is in the list can cause issues with application behavior - like missing records or records which appear in some queries (when the index is not used) but not in others (where the index is used). It may allow the insertion of values that should now be considered equal under the new collation rules, effectively violating uniqueness expectations.
Since PostgreSQL 10, the server tracks the collation version used when an index was built and can detect when it no longer matches the system’s current collation version, which is why reindexing may suddenly become mandatory after an upgrade.
However, PostgreSQL server does not analyze your actual data to determine whether the collation change affects your stored values. It only detects that the collation provider version has changed. In some cases, this means reindexing is required even though the effective sort order of your specific data has not changed. Because PostgreSQL cannot reliably determine whether your specific dataset is affected, it treats the situation as potentially unsafe.
When do collations change?

There’s a number of scenarios when collations change:
- OS /
glibcupgrade (Linux) - PostgreSQL relies on the systemglibcprovided collations so if a version changes, the collation rules may change as well. The example above is taken from the upgrade to glibc 2.28 which is one PostgreSQL Community remembers due to the ripple effect it caused. - ICU library upgrade (for ICU collations) - if PostgreSQL deployment uses ICU collations, upgrading ICU library will affect these. While they are not tied to
glibcthese changes also happen. - Major PostgreSQL upgrade (in some cases) - if the new version uses a different collation provider behavior or updated ICU integration
- Database restored on a system with different collation versions - logical dump/restore onto a host with different
glibc/ ICU versions can invalidate indexes.
When re-indexing is necessary
Looking at the above provided list an observation is quite immediate, that not every collation type is affected:
- if the collation is not dependent on
glibc/ ICU then it won’t be affected. As such C/POSIX collations are immune to such issues. - Same truth sticks to the data types. The collation change problem affects only those indexes which are the text or character based. All other datatypes like all types of integers and floating points, date, timestamp, geometric data types or even vector data remain unaffected.
What’s also important is that even if a collation changed it’s not necessary it will affect a given index. Think of it this way, if your database does not use any language specific characters, chances are that collation change will not require re-index. Unfortunately you don’t know that until you look inside your data. In controlled environments, teams may assess the risk before scheduling reindexing, but from PostgreSQL’s perspective the index must be considered potentially inconsistent until rebuilt.
What’s next?
Next week we’ll look at what is the reality of the DBA team regarding upgrades ∎





Discussion
We invite you to our forum for discussion. You are welcome to use the widget below.