PostgreSQL active-active replication, do you really need it?

by Jan Wieremjewicz

Before we start, what is active-active?

Active-active, also referred to as multi-primary, is a setup where multiple database nodes can accept writes at the same time and propagate those changes to the others. In comparison, regular streaming replication in PostgreSQL allows only one node (the primary) to accept writes. All other nodes (replicas) are read-only and follow changes.

In an active-active setup:

  • There is no single point of write.
  • Applications can write to any node.
  • The database needs a way to sort out conflicts when two nodes try to concurrently change the same data.

That last point is the hardest one. PostgreSQL was not designed for concurrent writes from multiple nodes; it’s not a distributed database and does not leverage proprietary dedicated storage capabilities. So, every multi-primary implementation has to solve the issue of conflicting concurrent writes somehow. Some resolve conflicts using timestamps or priorities. Some push conflict resolution to the application. Some avoid it altogether by writing to separate subsets of data.

While simple in concept, implementing an active-active configuration is challenging.

pgactive to the rescue?

Last week, Amazon open-sourced its active-active replication extension, pgactive (https://github.com/aws/pgactive). While the extension has been generally available on AWS RDS since October 2023, there are unfortunately not many stories about it being used in production available. To be fair, I was not able to find any ๐Ÿ˜Ÿ

We often see both users and customers come asking for active-active or multi-master. These terms, while different, are so often used as synonyms that we’ve come to expect that. So, though I understand that every multi-master is active-active but not necessarily the other way around, for the sake of clarity, if I use one or the other term throughout this post, they will refer to the same concept.

As it is an open-source extension now, it immediately raised my interest. It seems that it could cover this ask from users I often speak with about their pains and needs. As a product manager, when I hear an ask, I always try to understand the reasonsโ€”whether it is a requirement, a need, or actually a solution that addresses one. For multi-master, my strong opinion is that it is a solution.

Key question: do you need it?

I like the opening of the talk Johnathan Katz gave on PGConf Europe 2023 in Prague:

The first thing I always say on the journey to active active is: do you really need it? Because it definitely solves a lot of problems (โ€ฆ) but itโ€™s very hard to manage.

That is exactly the first question I ask when I hear someone asking for active-active. We have seen teams introduce active-active replication for the wrong reasons. Here I have to pause. Yes, as database experts, we have strong opinions about what are the right reasons for using multi-master. Itโ€™s not a silver bullet. It’s not “cool infra.” And using it without a good reason tends to hurt for a long, long time.

So, what are the reasons to use active-active? I do not claim to be able to cover all scenarios, but I hope this post raises enough eyebrows and sparks enough discussion to eventually have solid reading material for anyone considering active-active that will help them make an informed decision.

What are โ€œgoodโ€ reasons?

These are some of the situations where active-active might actually make sense. While there may be more, hereโ€™s my top 5:

  1. Business continuity across regions: extreme HA needs (99.999% uptime)

    Just to remind everyone what 5 nines mean, I will refer you to this message:

    26 seconds of downtime a month, thatโ€™s 312 seconds a year. Yes, 5.2 minutes a year.

    Now think about the cost of delivering that sort of reliability. I find this Wikipedia page surprisingly helpful in conveying how little time for maintenance and failures is left with enough nines added.

    Consider what it would take to absorb failures across data centers or cloud regions without rejecting writes or failing over manually. Active-active can help here because failover becomes instant and transparent; write traffic just shifts to surviving nodes.

    But again, the cost will match the ambition. Do you plan HA within the same server room with separate power and networking? Or are you aiming for full geographic separation, to stay online even during a country-wide outage? These decisions massively influence the architecture, and together with your uptime goals, they define the cost. At this level, every part of the solution should reflect real business needs, because every layer of complexity adds expense. You can’t overstate the value of planning and proper analysis when building systems like this.

  2. Write availability during regional failures

    If your business serves a global customer base and absolutely must accept writes in more than one region, for example, to maintain uptime guarantees or continue operating during a regional outage, then active-active might be the least painful of the painful options.

    This is not about low latency. This is about keeping write traffic flowing even when something breaks. That includes:

    All jokes aside, these are serious risks. If this kind of failure is unacceptable for your business, and you are willing to take on the operational weight and cost (we will get to that), active-active may be the right tool.

    But be honest about what you are solving. If your system demands strong consistency, every transaction still needs coordination across nodes. For example, if a user in Australia writes to a local node, and the other node is in the United States, that write still involves a round trip to the United States before it can commit. That round trip adds latency, not removes it. While it may be 150-200ms on average for the Australia to USA round trip, it adds up with volume.

    The real benefit of active-active here is not performance. It is write availability during failure. If your business cannot afford to reject writes when a region goes dark, and you are prepared for everything else that comes with this decision, this might be one of the rare cases where active-active makes sense.

    Just be clear, what you are solving here is not distributed latency, but write continuity when something fails.

  3. Migrating legacy architectures

    If you’re part of an organization moving away from systems like Oracle RAC or GoldenGate, where distributed write semantics were either built-in or at least promised, you may face business or political pressure to deliver “the same thing” on PostgreSQL.

    In these cases, active-active might be the shortest path to satisfying the checkbox. But itโ€™s almost always a transitional compromise, not the destination. As any compromise, that’s not going to be all pleasant. The technically better (but less politically correct) move is usually to re-architect for clearer ownership of writes and better separation of concerns.

    If you can push for that path, do it. If not, be aware of the cost youโ€™re inheriting.

  4. Application performance (not database performance)

    In the end, what you are really trying to improve is not the database throughput, but the end-user experience. Active-active may be worth considering not for improving database internals, but for reducing perceived latency in globally distributed apps or smoothing responsiveness during network transitions.

    In rare cases, this might justify active-active if the application can route users to their nearest region and issue local writes. But your app must be built for it. Deterministic conflict handling, idempotency, and careful session management are must-haves in such a case.

    If your database is fast, but the user still feels lag because the write travels halfway across the planet, active-active might help. But this should be a last resort, not a default choice.

  5. Local HA in disconnected or semi-connected environments

    In edge computing, retail stores, ships, or military use cases, you might want each node to function independently to address intermittent connectivity. In such scenarios, you will still be able to write locally when the network is not available. When the network comes back, the changes are going to be synced. While conflict avoidance may be the strategy you go for, in the end, itโ€™s going to become a cost of conflict resolution.

What’s next?

In the next blog post I will focus on the bad reasons to consider active-active replication and on the cost that should not be forgotten. Stay tuned! โˆŽ

Jan Wieremjewicz

Jan is a Senior Product Manager at Percona, leading the products for PostgreSQL. He has vast experience in the development, deployment and maintenance of enterprise systems.

Professionally, he is passionate about simple solutions that solve complicated problems and user experience that maximizes the product potential.

Privately he is a foodie by day, a tech geek into graphic novels, video and board games by night and parent/spouse in between.

Having spent already almost three years in Percona, most of us learned that he has enough energy and topics to fill in any space and time.

See all posts by Jan Wieremjewicz »

Discussion

We invite you to our forum for discussion. You are welcome to use the widget below.

โœŽ Edit this page on GitHub