Accelerating Development with Feature Flags

<p>At Subskribes, our velocity of feature development is high. Our development philosophy follows the model of Continuous Integration and Deployment (CI/CD). Ideally, when releasing a feature to a specific stage of our environment, we want to isolate that release from the release of any other feature. This enables us to make functionality available only when it has cleared our quality bar. To facilitate this isolation, we recently introduced Feature Flags into our environment, leveraging the <a href="https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html" rel="nofollow">AppConfig service from AWS</a>.</p>

<p>The concept of Feature Flags has been extensively covered elsewhere, so we will not go into depth here. <a href="https://martinfowler.com/articles/feature-toggles.html" rel="nofollow">This article</a> is a good starting point. Note that it covers use cases more complex than what we are targeting at Subskribe. At a high level, Feature Flags provide the ability to dynamically enable and disable application functionality at runtime. This allows code to be deployed as part of our continuous release methodology and then toggled on or off in different environments.</p>

Design

AWS AppConfig

A few options were considered, including other commercial solutions as well as rolling our own from scratch. Ultimately we decided to use AWS AppConfig. There were a number of reasons:

Our Technology stack resides in AWS and already has dependencies on AWS services.
The fee structure for this service from AWS was favorable.
AppConfig supports the functionality we were looking for (and more). This includes:
1. dynamically switching flags
2. versioning
3. a UI and command line interface for making changes
4. externalizing the flag settings outside of our application
5. different deployment strategies
6. a straightforward API for our application to fetch flag settings.

Frontend and Backend Support

The Subskribe application uses Node.js, Typescript, React, and GraphQL for its frontend and Java for its backend. Our approach therefore needs to support the ability to query whether a flag is enabled from both Typescript hosted in Node.js, as well as Java code running in the JVM.

While AWS AppConfig exposes an API that can be called from Node.js and Typescript, we want our frontend and backend to have a consistent view of which features are enabled. Because of that, we decided to expose a GraphQL query from our Java backend which returns the flag values stored there. The backend handles all queries to the AppConfig API.

Data Fetching and Caching

Feature Flag settings are stored both locally in configuration files deployed with the Subskribe application as well as remotely in the AWS AppConfig service. This allows developers to change settings locally without needing to depend on AWS. That said, if a feature has a setting both within a local configuration file as well as in AWS AppConfig, the AWS setting takes precedence (whether enabled or disabled). This enables us to toggle a feature on or off irrespective of what configuration settings happen to get deployed with the application.

On startup, the Subskribe application queries AWS to download the latest flag values. Since we view Feature Flag settings as a necessary component for the correct functioning of the application, if this download is not successful (after a number of retries) we abort application startup. To manage this startup process we use the Failsafe Java Library.

We cache the flag settings locally in a Guava Loading Cache with a fairly short TTL. This means that we will attempt to fetch the latest configuration quickly while the application is exercising code paths that have a flag present, but we will not fetch at all when the application is not. This allows us a good balance between getting the latest flag values while also not making unnecessary AWS API calls (and thus incurring more cost).

Unlike at application start, once the application is up and running, if a call to AWS to fetch the Feature Flag configuration fails (after some retries) we simply log an error and return the old, cached value. Subsequent queries of the feature flag values from Java or UI code will trigger new attempts to fetch the configuration from AWS.

Why do we require the flag values stored in AWS to be successfully read by the application on startup and not fallback to the config files? This is because we view AWS AppConfig as the source of truth for this data. If we relied on the local config file values in the face of AWS download failures, we would either need to ensure the local config values kept getting updated (which would defeat the purpose of dynamic flag settings) or we would have to live with an inconsistent set of flags whenever we had a hiccup in contacting AWS on application startup.

API Design

While the AppConfig library has an easy-to-use API, we wanted something simpler for our backend and frontend developers to query. As such, we built a very simple Features class which provides an interface to query whether a specific feature is enabled, abstracting away calls to AWS as well as lookups to any internal configuration.

@Inject
public class Features {
	public enum Feature {
		FEATURE_1(“feature1”)
		FEATURE_2(“feature2”)
		…
	}

	public Features(AWSAppConfigWrapper appConfig) {
		this.appConfig = appConfig;
	}

	public boolean isEnabled(Feature feature) {
		Map awsFeatures = appConfig.getFeatures();
		if (awsFeatures.containsKey(feature.getKey()) {
			return awsFeatures.get(feature.getKey());
		}

		// check local settings from the configuration file
		// and return that if present, otherwise just return false
		…
	}
}

As an implementation detail, we created a wrapper class around AWS’s AppConfigDataClient that abstracts away the loading, caching, and fetching that was described above.

To query from application Java code:

@Inject
class SomeClass {
	someMethod() {
		if (features.isEnabled(Features.Feature.FEATURE_1) {
			// handle the case where the feature is enabled
		}
	}
}

As noted above, we created a GraphQL query so these values can also be retrieved by our UI. The query has a simple definition which can be called using your favorite GraphQL client library:

query GetFeatureIsEnabled($feature: string!) {
…
}

which returns a boolean value.

The block diagram below provides an overview of the architecture we use in the Subskribe application.

Overview of architecture in Subskribe application

Deploying Feature Flag Updates

At Subskribe we deal with a lot of financial data and adhere to the SOC2 standard. One of the controls we have in place is reviews for changes that go into production. We use github for source control and leverage the PR system to have code reviewed before it is deployed. We wanted a similar level of validation on any Feature Flag deployments, but, in order to really realize their value, we also wanted their deployments to be lightweight, quick, and independent of any other deployment processes. We also wanted to minimize any tooling we needed to build.

We ended up building out simple yaml files, the format of which looks like:

environment.yaml:
feature1:enabled | disabled
feature2:enabled | disabled
…

Where environment is one of our deployment environments: dev, staging, or production. As you can see, each file contains the value for each Feature Flag, either enabled or disabled.

Those files are stored in github. We have a separate deployment pipeline which listens to github for updates to those yaml files and when a change is committed it makes AWS API calls (via the AWS cli), pushing the changes to the appropriate environment in AWS.

Finally, we were able to leverage github’s web editor for editing files in place in a repo. It allows us to make changes to a file and create a branch and PR all with a single click of a button. Thus we were able to avoid building out our own UI to manage Feature Flag updates.

The following diagram illustrates our different deployment flows.

Code and feature pipeline deployment flows

While our code deployment pipeline can take minutes or hours to work through (depending on various factors), our Feature Flag deployments complete within a few seconds.

Conclusion

We have been working with Feature Flags for a while. They have been successful in reducing the number of deployment issues we are seeing related to pushing functionality. When we have found an issue with a feature in one of our deployment environments, we have been able to quickly disable the offending functionality without needing to rollback or redeploy our code.