At Subskribe, we solve pretty hard engineering problems. One such hard problem is mutual exclusion of invoice generation. This blog post is the story of how we achieved this without spending a lot of engineering hours by utilizing the technologies at our disposal.

If there is anything that I have learned over the past 15 years of my engineering career, it is that distributed computing is not an easy problem to tackle. 

Thankfully, several people have realized the same over the years and took the work by Dr. Leslie Lamport (Paxos) and turned it into scalable solutions like Apache Zookeeper, ETCD, and Chubby. These solutions are wonderful and will always be used to solve hard distributed computing problems. They require a good level of expertise to operate and maintain and usually require a team to support them when they grow really large.

But, sometimes, you have available existing tools that you can use, such as a fully functioning Postgres database. This blog post provides deeper details about how we leveraged Postgres to solve the problem of Distributed Mutual Execution for our invoice generation needs.

Problem Statement

We build mission critical business logic. The business logic gets quite complex, and one critical area of this business logic is invoice generation. There are several constraints on invoice generation, and one such constraint is, "at any given time you should generate only one invoice for a given subscription."

Let us see how this constraint can be violated in a distributed scenario. Imagine two machines M1 and M2 both get instructions to generate an invoice for a subscription S1. M1 wants to generate a usage-related invoice for S1, but M2 wants to generate a recurring-charges-related invoice for S1.

Invoice generation problem

If M1 is not aware of M2 and cannot coordinate, they may generate an Invoice for S1 at the same time, andvoilàwe have violated our constraint, "at any given time you should generate only one invoice for a given subscription."

Why Not Just SELECT FOR UPDATE?

Postgres already supports SELECT FOR UPDATE or SELECT FOR UPDATE NOWAIT, so why not simply use that? This is a fair argument and would be perfectly valid, but we can do something better and use the underlying mechanism that Postgres uses to acquire locks—enter Postgres Advisory Locks.

There are several advantages to using advisory locks over SELECT FOR UPDATE locking:

  • PostgreSQL provides a means for creating locks that have application-defined meanings. This allows you to create locks on items that are not stored in the DB and mean something only to the application (e.g., locking on an arbitrary key that is stored only in application memory).
  • The RAW power of locks is exposed in the advisory locks API, so you can achieve fairly sophisticated locking behavior.
  • Advisory locks can be session level (connection level) or transaction level, offering flexibility on what level one can lock.
  • The cost of an advisory lock when acquired in an optimistic manner (more details later) is trivial, and several thousand such optimistic locks can be acquired at any given time.

Advisory Locks API/Contract 

At Subskribe, we only use the optimistic variant (try to acquire lock and fail) of the advisory locks. Pessimistic locking (try to acquire lock but wait until you can or timeout) is, in general, not a good pattern, and we haven’t seen much use for it in our engineering needs.

API/Function Explanation
pg_try_advisory_lock (key bigint) Try to acquire a session lock on a key that is a big int, return true if success, false if lock cannot be acquired
pg_try_advisory_xact_lock (key bigint) Same contract as above but acquire a transaction lock on key rather than session lock

WARNING: If you acquire a session level lock from the application, it is the responsibility of the application to explicitly release that lock (otherwise the lock would be held). If you acquire a transaction level advisory lock, Postgres automatically releases the lock when the transaction ends.

NOTE: Advisory locks is a comprehensive topic in Postgres, and the entire list of functions can be found by visiting Admin Functions, and simply searching for "pg_advisory_lock".

How Did It Solve the Problem? 

At last, we can see how all of the theories above can actually solve the problem.

Subscription is a first-class entity on our platform. When executing a sales order in our platform it creates the subscription object with all future invoice generation hinged off of this subscription entity. The subscription entity is modified whenever we create new orders and amendments with several background jobs dependent on those mutations.

So we did not want to lock the subscription because then other operations on subscription would also have to wait. Invoice generation was a specific case where mutual exclusion was required, while the rest of the operations can proceed on a subscription. 

The insight we had was to lock the subscription ID  for the invoice generation context. Simply put, by locking the string “invoice_gen/SUB-1234”, then two competing processes could coordinate on this key and safeguard the invoice generation critical section. 

Here are the steps in execution:

Invoice generation steps
  • Time T1: M1 (machine one) tries to acquire a transaction lock on “invoice_gen/SUB-1234”, and successfully acquires the lock. 
  • Time T2: M1 proceeds to generate an invoice for S1. 
  • Time T2: M2 (machine two) tries to acquire a lock on “invoice_gen/SUB-1234” and FAILS because the lock is held by M1. 
  • Time T3: M2 optimistically fails and will possibly try later to acquire locks. 
  • Time T4: M1 finishes invoice generation for S1 and commits the transaction. 
  • Time T5: Postgres releases the advisory transaction lock on “invoice_gen/SUB-1234” because the transaction is over (application need not release the lock).

So what did we achieve? We managed to achieve distributed mutual exclusion using Postgres advisory locks using only an arbitrary key (which is not even stored in the database). 

Locking String Vs. Number 

The reason we lock a string like “invoice_gen/SUB-1234” is that is what is available from the application layer and corresponds to the entity key/identifier we are locking. The advisory locks, however, use a bigint key as the argument (we are locking an integer not a string).

To address this problem, we use the hashing library provided by Google Guava. We needed a non-cryptographic fast hash that had great hash distribution properties and also was 64-bit (since we needed to turn a string into a big int).

We settled on the Sip Hash. This is a lesser known but very useful hash function of the “add-rotate-xor” family, which is reasonably fast, has very good distribution properties, and a Guava implementation known to work well.

Show Me the Code

This is all well and good, but can you show me the code? Here is the entire listing of the helper class we wrote to make use of the Postgres advisory locks:

package com.subskribe.billy.postgres;

import com.google.common.hash.HashCode;
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hashing;
import com.subskribe.billy.validation.Validator;
import java.nio.charset.StandardCharsets;
import java.util.Optional;
import org.apache.commons.lang3.BooleanUtils;
import org.jooq.DSLContext;
import org.jooq.Record;
import org.jooq.Result;

/**
* utility class that uses postgres advisory locks
* to implement application level locking
*/
public class PostgresAdvisoryLock {

   // pg_try_advisory_xact_lock expects bigint as locking key
   private static final String PG_ADVISORY_XACT_LOCK = "SELECT pg_try_advisory_xact_lock(%d)";

   // I chose Sip Hash because it is a decently fast with good unique range
   // this hash function is used in Linux, Haskell and Rust hashtable implementations
   // it is a 64-bit hash and the range is java long
   private static final HashFunction SIP_HASH_FUNCTION = Hashing.sipHash24();

   /**
    * try to acquire an advisory postgres lock with the given String key.
    * the String is converted into a long using a 64 bit non-cryptographic hash
    * the lock is automatically released after the transaction is complete.
    * 
    * if the lock is successful it returns the hash code on which the lock was acquired
    * if the lock was unsuccessful an {@link Optional#empty()} is returned
    *
    * @param dslContext non null DSL context
    * @param key        non-empty lock key
    * @return Optional of hash code if successful {@link Optional#empty()} otherwise
    */
   public static Optional tryAcquireLock(DSLContext dslContext, String key) {
       // calculate hash
       HashCode hashCode = SIP_HASH_FUNCTION.hashBytes(key.getBytes(StandardCharsets.UTF_8));
       long hashKey = hashCode.asLong();

       // use the hash to try locking
       String lockQuery = String.format(PG_ADVISORY_XACT_LOCK, hashKey);
       Result lockResult = dslContext.fetch(lockQuery);

       // Query is guaranteed to return a singleton boolean in the result set.
       if (BooleanUtils.isTrue((Boolean) lockResult.get(0).get(0))) {
           return Optional.of(hashKey);
       }
       return Optional.empty();
   }
}

Conclusions

Upon completion of this project, we have several takeaways:

  • You do not always need expensive consensus management systems to solve problems like Distributed Mutual Execution. Just looking a little further into existing and available tools may lead to quicker working solutions and avoid costly delays. 
  • Postgres is a very versatile database and has a lot of bells and whistles that can prove useful. 
  • Defining the problem you want to solve clearly can lead to creative, focused solutions. 
  • We had to write very minimal code (the code in the post above) to achieve mutual distributed exclusion, which was a big win in terms of time to production.