Agile, CI/CD, Extreme Programming, Refactoring, Software Architecture

Surviving Continuous Deployment in Distributed Systems

Read the article here or watch the talk delivered at XConf Europe:

Introduction

This is an article about the day to day of software development.

The industry seems to be at a point where a lot of us are practising Trunk Based Development and/or Continuous Deployment, or we are at least hassling our managers working towards it. I’m also a big fan.

But, as cool and shiny as these practices are, and as much as we reassure our fellow developers and stakeholders, I believe they do present some risks.

Not a lot is being said about how they affect the life of developers: as each change we make now goes immediately to production, it has the potential to affect a complex web of services. These services depend on each other in often intricate ways, and service interdependencies are probably among the hardest things to test. One wrong commit which our pipelines don’t catch, and production goes down. Or data is corrupted.

When deployments were slow and clunky and happened with a human pressing on a button, we at least had the chance to explicitly set aside some time to think about all of the above, and (maybe) resolve these intricacies before going live. But now that everything goes live all the time, without the chance to look at it first, the same cognitive load needs to be spread over everyday code making efforts. It doesn’t simply go away.

In this article, I want to share my approach to organizing this cognitive load: a framework to perform incremental, safe releases during everyday story work, which tries to reconcile our new one-commit = one-deploy developer workflow with the intricacies of distributed systems. This is a collection of mostly existing concepts and practices, but they are structured specifically around the challenges of CD.

Who this is for

This article was written with two different groups in mind:

  • Teams who are beginning to adopt Trunk Based Development and/or Continuous Deployment: once a team removes the very final gate to production, there is sometimes a “now what?” moment. Nobody is used to this way of working, and one might start feeling insecure about the code being pushed all of a sudden. Yet it is important to not break the stakeholders’ trust (and the application in front of the users, most importantly).
  • Programmers who have never used Trunk Based Development and/or Continuous Deployment, and are joining a mature team which already adopted these practices a while ago. It can be very daunting to step into such a situation and suddenly, with little context of the application itself, having to check your work in which might or might not break something.

Having witnessed both situations, I had wished that there was some more organised material and literature to use myself or to share with my colleagues on the subject. This aims to be a step in that direction, and to shift the conversation explicitly to Continuous Deployment after we’ve been talking a lot about Continuous Delivery for so long.

Disclaimer

This collection of practices is 1) very opinionated, as it is based on the author’s experience, and 2) by far not enough on its own to judge whether a team is ready to step into CD.

A healthy relationship with stakeholders, a robust TDD culture, a zero-downtime deployment strategy, infrastructure as code, a comprehensive suite of tests executed by a well-oiled pipeline…These are just some of the prerequisites I would make sure are in place before removing the gate to production.

Recap of the practices

Before diving in, let’s take a moment to refresh some definitions.

What is Trunk Based Development?

Trunk Based Development focuses on doing all work on Mainline (called “trunk” [...]), and thus avoiding any kind of long-lived branches.

What is Continuous Deployment?

Continuous Deployment, means that every change on your mainline, or "trunk" branch goes through the pipeline and automatically gets put into production.

The concepts themselves are quite simple. If you’re new to this and want to understand the subject a little bit better before continuing, you should check out Martin Fowler’s blog, the Continuous Delivery book and maybe this article.

But why should we immediately send every commit to production?

It sure seems scary, but the one-commit = one-deploy workflow can drastically improve team productivity for several reasons. It’s outside of the scope of this article to explain in-depth why these practices are a good idea, but I will still try to summarise what I think are the main advantages of using them together:

  • Immediate integration: there is immediate feedback of how new changes integrate with other team member’s code, but also how they react to real production load, data, user behavior, third party systems etc. This means when a task is done it is really done and proven to work, not “done pending successful deployment”. No more surprises after the developers forgot what the code that just got released was even doing.
  • Smaller change deltas: each individual release becomes less risky, simply because there is less stuff being released at any given time. When the only difference between a version and the next is a few lines of code, then finding bugs becomes trivial. If a revert is needed, it is also going to impact a very small functionality rather than delay a bunch of unrelated features.
  • Developer friendliness have you ever seen the git history of a repository where all the developers work on trunk? It’s straightforward. No branches, no guessing which commits are in which branch, no merge hell, no cherrypicking to fix merge hell, you get the gist. Also, all individual commits now have to be self consistent enough to be release candidates. So also no guessing what can be deployed where: ideally any revision is able to survive into production on its own.

Example: Animal Shelter Management System

From this moment on, all of the concepts will be demonstrated using an example application.


We shall call it the “Animal Shelter Management System”, and it does exactly what the name says: allows animal shelter employees to manage every aspect of the life of their animal guests.

It even has a fancy, modern UI:

We will assume that this application is web-based and has a simple, idealized architecture, which might look very familiar to most developers. It has some sort of persistence – a relational database for our example – and a backend that acts as an API to a one-page application frontend. Also, we will assume the API collaborate with third-party systems to read or provide some data.

The source code to represent and deploy these components is split into two source control repositories. The first contains the backend code and the database evolutions, plus the infrastructure code to deploy them. In the second, we will find all the front-end code for our one-page application and again its own infrastructure code.

Each of the repositories has its own pipeline which runs all the tests and eventually deploys to production.

We will also assume that this application is already live, and it is being enjoyed by hundreds of thousands of users.

The mindset shift

Contracts between distributed components

When talking about distributed systems, we are used to thinking about contracts and everything that goes with them in the context of our system (and our team) versus the outside world.

For example, our API (consumer) might rely on a third party system to perform some task (producer), or vice versa. Every time we need a new feature or there is a change proposal our developers will need to interface with the people responsible for the outside system.

These situations have been described extensively in the literature, and most organisations already have processes in place to deal with them, whether the third party is another team or a vendor.

We don’t think that explicitly about contracts, however, when any given system which we would normally consider a “unit”, or a “microservice” has distributed sub-components within itself. We have distributed components the very moment that any sort of inter-process communication happens (over the network, using files, pipes, sockets etc.).
Most non-trivial systems respect this definition.

Our Animal Shelter Management System is no exception, of course, having lots of obviously distributed bits and pieces: the persistence, API and UI are all talking to each other over the network.

And contracts certainly exist between them too: the UI consumes the API and expects it to respond in a certain way. Also, the API can also be seen as the consumer of the persistence layer, as it relies on a certain schema being available.

Why does this matter?

Because any given task a developer starts can span across multiple of these distributed sub-components. If we have split our stories right, in fact, the developers will deliver features end to end by design. This is different than when there needs to be outward-facing communication: as the components are all owned by the same team, there is no process and certainly no meetings around making sure all changes are retro-compatible and happen in the right order between producer and consumer.

So how have developers been dealing with it?

Order of Deployment

From my experience, when deployment to production is not happening every commit, dependencies between producers and consumers are usually managed by manually releasing each system’s changes in the correct order by the developers themselves (assuming we are in a cross-functional team). This means that anyone who picks up a task can mindlessly start working on whichever codebase they wish, as all of their changes will be waiting to be released correctly by humans clicking buttons.

Or, we can say, code changes go live in their order of deployment.

See the example below of a change spanning backend and frontend, in which the developers can start with the frontend (which would not do anything useful without the underlying API) but then they can make sure to release in the reverse order.

Order of Development

Now imagine that the gate to production (and the human) have been removed. Suddenly changes don’t stop in
some staging environment: they immediately get released.

Or, features go live in their order of development.

The same developers mindlessly deciding to start from one place or another would now release a UI change that has no business being in production before the API is finished.

The only way to avoid these situations (without any extra practices) is to make sure that the order of development is the same order in which code changes should be released.

This is a substantial change from the past: developers used to be able to pick a new task and just start somewhere, maybe on the codebase they are most comfortable with or with the first change that comes to mind.
Now it’s not enough to be able to know in advance which component needs changing and how, but also there needs to be a conscious decision and planning of all code changes going live at the granularity of individual commits.
So, quite a lot of preparation is needed just start typing.

In the following sections we’ll see how it’s possible to deal with (or remove) some of this planning overhead by following different approaches based on the type of code change being introduced:

  • New features addition
  • Refactoring live features

As each of them has different implications.

Adding new features

Let’s begin with our first example. Imagine we have the following user story in our backlog:


As an animal shelter volunteer

I want to be reminded when animals need feeding

So that their diet can be on an accurate schedule


This implies adding a reminder section to our interface, with a navigation option to access it and an “add reminder” button.

Target state

Given these requirements, we will first imagine what is the target state of all the codebases that make up our system.

Frontend

The frontend needs an “Add Reminder” button which, when clicked, might call a remindMe() function. This function will call the backend with the information of our reminder. We will also need an extra <li> in the navbar’s <ul> to link to our new “Reminders” section.

const remindMe = () => {
   const url = `https://my-service-api/user/${userId}/reminder`

 fetch(url,{ method: 'POST', body: {/*details*/} })
  .then(success())
  .catch(error())
}
<!-- in the main navigation -->
<ul id="main-nav"> 
  <!-- ... other menu items ... -->
  <li id="reminders-nav">
    <a>My reminders</a>
  </li>
  <!-- ... other menu items ... -->
</ul>


<!-- somewhere in the page -->

<button id="reminders-btn" onclick="remindMe()">
 Remind me
</button>

Backend

The backend needs endpoints to create new reminders and see existing ones. They can be under an existing UserController since reminders are associated with a user.

@RestController
class UserController {

  ... //other user related endpoints

   @PostMapping("/user/{id}/reminder")
   Reminder newReminder(
        @PathVariable Long userId,
        @RequestBody ReminderPayload newReminder
        ) {
     return repository.save(newReminder, userId);
   }

   @GetMapping("/user/{id}/reminder")
    List<Reminder> allForUser(
      @PathVariable Long userId
      ) {
      return repository.findAllFor(userId);
    }
 
}

Persistence
Finally, we will need a table to persist the reminders.

Table "public.reminders"

Column    |  Type   | 
-------------+---------+
reminder_id | integer | 
identifier  | uuid    | 
user_id     | integer | 
interval    | integer | 

Indexes:
"reminders_pkey" PRIMARY KEY, btree (reminder_id)
Foreign-key constraints:
"reminders_user_id_fkey" FOREIGN KEY 
  (user_id) REFERENCES users(user_id)

How do we get there without breaking production?

Without Continuous Deployment we might simply start implementing from one of these three places and then continue working in no particular order. But, as we saw before, now that everything is going to production immediately we need to pay a bit more attention. If we wanted to avoid exposing broken functionality, we would need to start with the producer systems first, and gradually move upwards to the interface.

This allows every consumer to rely on the component underneath, with the UI eventually being released on top of a fully working system, ready for users to start creating reminders.

This approach certainly works – but is the optimal way to approach new features? What if we prefer to work another way?

Outside-In

Many developers prefer to approach the application from the outside in when developing something new. Especially if they are practicing “outside-in” TDD and would love to write a failing end to end test first. Again, it is out of the scope of this article to explain all of the benefits of this practice, but here is a summary:

  • Starting by the layers visible to the user allows for early validation of the business requirements: if it is unclear what should be the visible effects of the feature from the outside, this step will reveal it immediately. Starting development becomes the last responsible moment for challenging badly written user stories.
  • The API of each layer is directly driven by its client (the layer above it), which makes designing each layer much simpler and less speculative. This reduces the risk of having to re-work components because we did not foresee how they would be invoked, or to add functionality that will end up not being used.
  • It works really well with the “mockist”, or London school of TDD approach to implement one component at a time, mocking all of its collaborators.

But, this implies that the order ideal for development is exactly the opposite of the order needed to not break the application!

In the following section, we will see a technique we can use to resolve this conflict.

Feature Toggles (or Feature Flags)

Feature Toggles are a technique often mentioned together with CI/CD. Quoting Pete Hodgson, a feature toggle is a flag that allows to “ship alternative code paths within one deployable unit, and choose between them at runtime”.

In other words, they are boolean values that you can use in your code to decide whether to execute one branch of behaviour or another. Their value usually comes from some state external to the application (S3 bucket, parameter store, database, a file somewhere, etc….) so that they can be changed independently of deployment.

In other words, creating a feature toggle means putting an if statement.

if (useNewAlgorithm) {
   return newWayOfCalculatingResult();
} else {
   return oldWayOfCalculatingResult();
}

With Continuous Deployment, we can use them to decouple the order of changes needed to not break contracts from the order in which we want to develop. This is achieved by using the feature toggle to hide half baked functionality from the users, even if its code is in production.

Other feature toggle benefits

Feature toggles that have some advanced configuration options can be used to do even fancier things: perform QA directly in production (only enable feature toggle by user id or by a special header), A/B testing (only enable the toggle if the user sessions fall into test group or control group), gradual ramp-up to release features slowly (toggles activated by percentage), etc.

You can just check out one of the many feature toggles libraries to see if they have these features.

Implementing with a Feature Toggle

We can try to proceed outside-in by simply adding a feature toggle to the UI, like this:

Step 1: Adding the frontend code under a toggle
We can just add the same code we would have in the target state, but with the addition of a toggle. This will hide the new elements (button and menu item while the feature is still disabled. The code can go to production immediately, even without the API being ready, as the user won’t be able to click on the button and get the ugly 404 error from the absent endpoints.

const remindMe = () => {
   const url = `https://my-service-api/user/${userId}/reminder`

 fetch(url,{ method: 'POST', body: {/*details*/} })
  .then(success())
  .catch(error())
}

const featureToggleState = //retrieve from somwehere

if (!featureToggleState.REMINDERS_ENABLED) {
  document.getElementById("reminders-nav").style.display = none;
  document.getElementById("reminders-btn").style.display = none;
}
<!-- in the main navigation -->
<ul id="main-nav"> 
  <!-- ... other menu items ... -->
  <li id="reminders-nav">
    <a>My reminders</a>
  </li>
  <!-- ... other menu items ... -->
</ul>


<!-- somewhere in the page -->

<button id="reminders-btn" onclick="remindMe()">
 Remind me
</button>

Step 2: Adding the endpoints
We can then add the endpoints to the API. They won’t work yet, as they rely on a table in the persistence layer that doesn’t exist. But, thanks to the toggle, it doesn’t matter.

@RestController
class UserController {

  ... //other user related endpoints

   @PostMapping("/user/{id}/reminder")
   Reminder newReminder(
        @PathVariable Long userId,
        @RequestBody ReminderPayload newReminder
        ) {
     return repository.save(newReminder, userId);
   }

   @GetMapping("/user/{id}/reminder")
    List<Reminder> allForUser(
      @PathVariable Long userId
      ) {
      return repository.findAllFor(userId);
    }
 
}

Step 3: Creating the table
Finally, we can get to the last layer and add the table we need, which will make the flow work end to end if we’ve done everything right.

Table "public.reminders"

Column    |  Type   | 
-------------+---------+
reminder_id | integer | 
identifier  | uuid    | 
user_id     | integer | 
interval    | integer | 

Indexes:
"reminders_pkey" PRIMARY KEY, btree (reminder_id)
Foreign-key constraints:
"reminders_user_id_fkey" FOREIGN KEY 
  (user_id) REFERENCES users(user_id)

Step 4: Toggling on
Once all necessary testing has been done, the feature toggle can be enabled. Once the feature is 100% live (and is there to stay) we can move on to the next step.

Step 5: Cleaning up the toggle
We can clean up the toggle and the code from our frontend codebase, finally reaching the target state we imagined for all of our components.

const remindMe = () => {
   const url = `https://my-service-api/user/${userId}/reminder`

 fetch(url,{ method: 'POST', body: {/*details*/} })
  .then(success())
  .catch(error())
}

//toggle removed! elements will not be hidden
<!-- in the main navigation -->
<ul id="main-nav"> 
  <!-- ... other menu items ... -->
  <li id="reminders-nav">
    <a>My reminders</a>
  </li>
  <!-- ... other menu items ... -->
</ul>


<!-- somewhere in the page -->

<button id="reminders-btn" onclick="remindMe()">
 Remind me
</button>

Summary

As we saw, a new feature can be added under a feature toggle, approaching the application from the outside in. This creates more overhead for developers as they have to manage the lifecycle of the toggle, and remember to clean it up once the feature is consolidated in production. However, it also frees them from having to worry about when to commit each component – and allows them to enable a feature in production independently of a deployment. Some brave teams even allow their stakeholders to enable toggles by themselves.
You can read more about feature toggles in this article by Martin Fowler’: https://martinfowler.com/articles/feature-toggles.html.

Refactoring

We will now focus on how to refactor across distributed systems whilst avoiding breaking existing features.
For our second example, we will imagine that we are changing the way we represent currency inside our system, without altering any functionality.

But why would we want to do that?


Imagine that one of our developers stumbles upon this fascinating article on Twitter:

So they learn that it is very dangerous to represent currency as a float – much better to use the full value up to the cents as an integer, and then format it for the user later.
But suddenly they remember a certain feature in the Animal Shelter Management System… uh oh!

This is a feature that allows the shelter volunteers to record an expense for the animal’s food so they can keep track of their costs. It looks like a prime candidate for the Money Mistake™.

Upon closer inspection of the code, their fears are confirmed: the upper 2 layers of the application indeed use floats to represent monetary values.

Current State

Here’s how the application looks like now.

Frontend

The frontend has a button that triggers an addExpense() function. Inside we can find the dreaded parseFloat(), which is used to parse the monetary value from the user input. The value is then sent to the backend with a POST to create the expense.

const addExpense = () => {
  const rawAmount = document
    .getElementById("expense-input")
    .value
  const amount = parseFloat(rawAmount)
    
  fetch(`https://my-service-api/animal/${animalId}/expense`,
   { 
     method: 'POST', 
     body: { amount: amount } 
   })
  .then(success())
  .catch(error())
}
<!-- button -->
<button class="expense-modal-open">
   Add Expense
</button>


<!-- modal -->

<div id="expense-modal">
  <input 
    type="text" 
    id="expense-input"
    placeholder="Enter amount in $"
  >
  <button onclick="addExpense()">
    Add expense
  </button>
</div>

Backend

The backend endpoint reads the value as a float from the payload, and passes it around as a float in all of its classes. Thankfully it converts it to integer before persisting it: the amount gets multiplied by 100 inside the toIntCents function (as it should always be) so that it can save it as cents.

@RestController
class AnimalController {

   @PostMapping("/animals/{id}/expense")
   Expense newExpense(
        @PathVariable Long animalId,
        @RequestBody ExpensePayload newExpense
        ) {
     return repository.save(newExpense, animalId);
   }
 
}

class ExpensePayload {
  private Float amount;
}

class AnimalRepository {

    public Expense save(
      ExpensePayload payload, 
      String animalId) {
        // ...
        ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " + toIntCents(amount) + "
        )"
        )
        return toExpense(resultSet);
    }
  
}

private Integer toIntCents(Float amount) {
    return amount * 100;
}

Persistence
Finally, our persistence stores the amount in cents as a bigint type.

Table "public.expenses"

Column    |            Type             | 
--------------+-----------------------------+
expense_id   | integer                     |
identifier   | uuid                        |
animal_id    | integer                     |
created_date | timestamp                   |
amount       | bigint                      |

Indexes:
"expenses_pkey" PRIMARY KEY, btree (expense_id)
Foreign-key constraints:
"expenses_animal_id_fkey" FOREIGN KEY 
  (animal_id) REFERENCES animals(animal_id)

We want to refactor our application so that all layers (not just the persistence) handle the amount as an integer that represents the cents value.

In the next section, we will see the target state o the system.

Target State

Frontend
Our goal is for the frontend to immediately parse the value inputted by the user as cents (multiplying by 100), and pass it to the API.

const addExpense = () => {
 const rawAmount = document
   .getElementById("expense-input")
   .value
 const amountInCents = parseInt(rawAmount.replace('.', ''))
   
 fetch(`https://my-service-api/animal/${animalId}/expense`,
  { 
    method: 'POST', 
    body: { amount: amountInCents } 
  })
 .then(success())
 .catch(error())
}
<!-- button -->
<button class="expense-modal-open">
   Add Expense
</button>


<!-- modal -->

<div id="expense-modal">
  <input 
    type="text" 
    id="expense-input"
    placeholder="Enter amount in $"
  >
  <button onclick="addExpense()">
    Add expense
  </button>
</div>

Backend

The backend should read the correct value from the payload save it to the persistence as it is, without needing to modify it (it’s already in the correct format).

@RestController
class AnimalController {

   @PostMapping("/animals/{id}/expense")
   Expense newExpense(
        @PathVariable Long animalId,
        @RequestBody ExpensePayload newExpense
        ) {
     return repository.save(newExpense, animalId);
   }
 
}

class ExpensePayload {
  private Integer amountInCents; 
}

class AnimalRepository {

    public Expense save(
      ExpensePayload payload, 
      String animalId) {
        // ...
        ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " + amountInCents + "
        )"
        )
        return toExpense(resultSet);
    }
  
}

How do we get there without breaking production?

This is different from our last example, as there is no new feature to be kept hidden from the users – instead, the feature is already live (and there are no new interfaces to discover as all the code is well known). So, feature toggles are probably not the right approach here, as most of their benefits are diminished while their overhead remains.
We can then take a step back and ask ourselves if there is an order in which we can release our changes to not break anything.

Releasing the frontend code first will result in errors from the backend, which is still expecting to be called with floats.

But releasing the backend code first will also end up similarly, with the frontend still sending the old now, incompatible format.

Whichever order we choose, it is clear that the functionality will be broken in front of the users for some time unless we take special precautions to avoid that.

Expand and Contract (or Parallel Change)

Expand and contract is a technique that allows changing the shape of a contract while preserving functionality, without even temporary feature degradation.
It is frequently mentioned in the context of code level refactoring (changing between classes), but it works even better in the context of distributed systems depending on each other.
It consists of three steps:

  • Expand phase: in this phase we create the new logic in the producer systems under a separate interface that their consumers can use, without removing or breaking the old one.
  • Migrate phase: all consumers then are migrated across to use the new interface.
  • Contract phase: once all the consumers are using the new flow, the old one can finally be removed.

Inside-Out

With the expand and contract approach, we have to start expansions with the producer systems and then migrate the consumers. This means we have to start with our innermost layers, working our way out to the ultimate client (UI code).

Notice that this is the opposite of the direction we took in the previous section.

We can try to use this to resolve the dependencies in our money example.

Implementing with Expand and Contract

We can proceed by starting with the system that needs to be expanded: our backend (the producer).

Step 1: Expand phase
We can make the backend work entirely with integers, as long as the interface supports both integers and floats. In our case, we can have the controller try to guess if the client is sending a float amount and if so it should convert it to cents. If it is already in cents, it can do nothing.


(In some other cases with HTTP APIs, supporting two interfaces in the same endpoint becomes so complex that it’s easier to just make a different endpoint, but we won’t do it for our example).

@RestController
class AnimalController {

   @PostMapping("/animals/{id}/expense")
   Expense newExpense(
        @PathVariable Long animalId,
        @RequestBody ExpensePayload newExpense
        ) {
         String rawAmount = body.get("amount");
         ExpensePayload newExpense = isFloatFormat(rawAmount) 
            ? new ExpensePayload(toIntCents(rawAmount))
            : new ExpensePayload(rawAmount)
         return repository.save(newExpense, animalId);
       }

   }
 
}

class ExpensePayload {
  private Integer amountInCents; 
}

class AnimalRepository {

    public Expense save(
      ExpensePayload payload, 
      String animalId) {
        // ...
        ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " + amountInCents + "
        )"
        )
        return toExpense(resultSet);
    }
  
}

Step 2: Migrate phase
The front end can now be changed to its target state, where it sends cents instead of floats. This makes the old flow unused on the API side.

const addExpense = () => {
 const rawAmount = document
   .getElementById("expense-input")
   .value
 const amountInCents = parseInt(rawAmount.replace('.', ''))
   
 fetch(`https://my-service-api/animal/${animalId}/expense`,
  { 
    method: 'POST', 
    body: { amount: amountInCents } 
  })
 .then(success())
 .catch(error())
}
<!-- button -->
<button class="expense-modal-open">
   Add Expense
</button>


<!-- modal -->

<div id="expense-modal">
  <input 
    type="text" 
    id="expense-input"
    placeholder="Enter amount in $"
  >
  <button onclick="addExpense()">
    Add expense
  </button>
</div>

Step 3: Cleanup phase
Once the frontend code is migrated, we can remove the old flow and boilerplate code from the API, which is reaching its target state as well.

@RestController
class AnimalController {

   @PostMapping("/animals/{id}/expense")
   Expense newExpense(
        @PathVariable Long animalId,
        @RequestBody ExpensePayload newExpense
        ) {
     return repository.save(newExpense, animalId);
   }
 
}

class ExpensePayload {
  private Integer amountInCents; 
}

class AnimalRepository {

    public Expense save(
      ExpensePayload payload, 
      String animalId) {
        // ...
        ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " + amountInCents + "
        )"
        )
        return toExpense(resultSet);
    }
  
}

Summary

Existing features should be refactored with the expand and contract pattern, approaching from the inside out. This is an alternative to having to use feature toggles for day to day refactoring, which are costly and require clean up.
Of course, there might be some rare exceptions where the refactoring we are performing is especially risky, and we would like to still use a toggle to be able to switch the new flow off immediately, independent of deployment.
Such situations however should not be the norm, and the team should question whether it is possible to take smaller steps whenever they arise.

This example was focusing specifically on the contract between frontend and backend within our system, but the same pattern can be applied between any two distributed systems that need to alter the shape of their contract (and with synchronous and asynchronous communication).

Special precautions should be taken however when the exclusive job of one of those systems is to persist state, as we will see in the next section.

You can read more about the expand and contract pattern here: https://martinfowler.com/bliki/ParallelChange.html

Data and data loss

The attentive reader might have noticed what a lucky coincidence it was that the persistence layer was missing from our previous example, and how tricky it could have been to deal with. But programmers in the wild are seldom that lucky. That layer was conveniently left out as it deserves its own section: we will now address the heart of that trickiness and see how to refactor the database layer without any data loss.

First, let’s explore how the target state would look like in the same money example, but this time with all 3 layers suffering from the issue.

Current State

Frontend
The frontend doesn’t change: it’s sending floats just like before.

Backend
This time the backend does not convert the float amount to cents before persisting it, as the database schema requires floats too now.

@RestController
class AnimalController {

   @PostMapping("/animals/{id}/expense")
   Expense newExpense(
        @PathVariable Long animalId,
        @RequestBody ExpensePayload newExpense
        ) {
     return repository.save(newExpense, animalId);
   }
 
}

class ExpensePayload {
  private Float amount;
}

class AnimalRepository {

    public Expense save(
      ExpensePayload payload, 
      String animalId) {
        // ...
        ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " amount + "
        )"
        )
        return toExpense(resultSet);
    }
  
}

Persistence
And here is the database table with the incorrect decimal datatype.

Table "public.expenses"

Column    |            Type             | 
--------------+-----------------------------+
expense_id   | integer                     |
identifier   | uuid                        |
animal_id    | integer                     |
created_date | timestamp                   |
amount       | decimal                     |

Indexes:
"expenses_pkey" PRIMARY KEY, btree (expense_id)
Foreign-key constraints:
"expenses_animal_id_fkey" FOREIGN KEY 
 (animal_id) REFERENCES animals(animal_id)

Again, let’s see what is the final state we imagine for the system once the problem is solved in all the layers.

Target state


Backend
This time our target state will be what we started from in the last example: the backend converting and persisting the expense amount as cents (even if it is not receiving cents from the frontend yet).

@RestController
class AnimalController {

   @PostMapping("/animals/{id}/expense")
   Expense newExpense(
        @PathVariable Long animalId,
        @RequestBody ExpensePayload newExpense
        ) {
     return repository.save(newExpense, animalId);
   }
 
}

class ExpensePayload {
  private Float amount;
}

class AnimalRepository {

    public Expense save(
      ExpensePayload payload, 
      String animalId) {
        // ...
        ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " + toIntCents(amount) + "
        )"
        )
        return toExpense(resultSet);
    }
  
}

Persistence
Similarly, the target state of the persistence layer will be what we could take for granted in the last example: currency being stored as a bigint type. We will need a database evolution to convert the type of the column and the existing data (multiply by 100 to obtain cents value).

ALTER TABLE expenses
ALTER COLUMN amount TYPE bigint 
USING (amount * 100)::bigint;
Table "public.expenses"

Column    |            Type             | 
--------------+-----------------------------+
expense_id   | integer                     |
identifier   | uuid                        |
animal_id    | integer                     |
created_date | timestamp                   |
amount       | bigint                      |

Indexes:
"expenses_pkey" PRIMARY KEY, btree (expense_id)
Foreign-key constraints:
"expenses_animal_id_fkey" FOREIGN KEY 
  (animal_id) REFERENCES animals(animal_id)


How do we get there without breaking production?

One change per repository?

On many occasions, I have seen the persistence code being kept in the same source control repository as the backend code. Our example is no exception.

With such a setup, it might be tempting to add the database evolution and create the backend code that relies on the new schema shape in the same commit.


However, just because two changes live in the same repo doesn’t mean that they don’t affect different components. And it doesn’t mean they will be released simultaneously. In any given pipeline, the database changes will be deployed in a separate step than the application code.

If the database evolutions are applied first, for example, our application will still attempt to save the old format in the database until the new version of it is deployed. This will lead to a brief period of failed requests, and data loss.

The same is true when the deployments happen in the opposite order. Therefore, we can conclude that we should isolate changes belonging to different distributed components in separate releases, even though their codebases might be versioned together.

Can we apply expand and contract?

It might also be tempting to simply try and apply the expand and contract pattern from the last section (we are dealing with refactoring an existing functionality, after all). We could imagine the expand and contract phases to look something like this:

  • Expand phase: expand our schema by creating another column amount_cents. Copy and convert all existing data to it. Old clients still write to old column amount and will need to be migrated
  • Migration phase: migrate all clients to write to new column amount_cents
  • Contract phase: finally remove the old column amount

However, this will also cause a data loss: nothing is being written to the new column between the expand and contract phases.

As we can see in the picture, there will be a gap in our new column in between the phases. The application will start using it and potentially return empty results or exceptions when retrieving data from that time window.

How can we avoid data loss then?

In the book “Refactoring Databases”, Scott J Ambler and Pramod J. Sadalage suggest relying on a database trigger to prevent this sort of scenario.

This would indeed start synchronizing old and new columns from the moment the new column is born. However, if like this author you’re not exactly thrilled to be implementing important logic in SQL (and just generally shiver at the thought of database triggers), you might find the next section more interesting…

Pre-Emptive Double Write

We can make a little addition to our existing expand and contract pattern: before starting, we can change the application to attempt to write to both columns.

The column amount_cents will not exist yet, but we will code the application in a way that tolerates a failure when writing to it. Then we can proceed with the steps we had originally planned:

  • Expand phase: expand our schema by creating another column amount_cents. Copy and convert all existing data to it. Old clients still write to old column amount and will need to be migrated
  • Migration phase: migrate all clients to write to new column amount_cents
  • Contract phase: finally remove the old column amount

This will ensure that the very second the amount_cents is created, data will start successfully being written to it, removing the gap we have observed in the previous section.

Implementing with Pre-Emptive Double Write

Step 1: Double Write
We first need to change the backend so that it will try to persist in both formats. Notice the try/catch block around the attempt to write to our new column, as we need to tolerate it not existing yet.

class AnimalRepository {

  public Expense save(
    ExpensePayload payload, 
    String animalId) {
      // ...
      ResultSet resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount) VALUES " +
        "(...,
          " + amount + "
        )");

      try {
        resultSet = statement.executeQuery(
        "INSERT INTO expenses (animal_id, amount_cents) VALUES " +
        "(...,
          " + Math.round(amount * 100) + "
        )");
      } catch (Exception e) {
        // tolerate failure
      }

      return toExpense(resultSet);
  }

}

Step 2: Expand
We can now create the new column and copy all the existing data to it with a database evolution. As soon as this runs, the column will start being populated with new data by the code above (without any gap).

ALTER TABLE expenses
ADD COLUMN amount_cents TYPE bigint;

ALTER TABLE expenses
SET amount_cents = (amount * 100)::bigint;
Table "public.expenses"

Column    |            Type             | 
--------------+-----------------------------+
expense_id   | integer                     |
identifier   | uuid                        |
animal_id    | integer                     |
created_date | timestamp                   |
amount       | decimal                     |
amount_cents | bigint                      |

Indexes:
"expenses_pkey" PRIMARY KEY, btree (expense_id)
Foreign-key constraints:
"expenses_animal_id_fkey" FOREIGN KEY 
  (animal_id) REFERENCES animals(animal_id)

Step 3: Migrate
We can now migrate the backend to write and read from the new column. (And remove the now redundant try/catch too).

class AnimalRepository {

  public Expense save(
    ExpensePayload payload, 
    String animalId) {
     // ...
     resultSet = statement.executeQuery(
     "INSERT INTO expenses (animal_id, amount_cents) VALUES " +
     "(...,
       " + Math.round(amount * 100) + "
     )");
     return toExpense(resultSet);
  }

}

Step 4: Contract
We can add another database evolution to get rid of the old column, finally reaching our target state for both the persistence and the backend.

ALTER TABLE expenses
DROP COLUMN amount;
Table "public.expenses"

Column    |            Type             | 
--------------+-----------------------------+
expense_id   | integer                     |
identifier   | uuid                        |
animal_id    | integer                     |
created_date | timestamp                   |
amount_cents | bigint                      |

Indexes:
"expenses_pkey" PRIMARY KEY, btree (expense_id)
Foreign-key constraints:
"expenses_animal_id_fkey" FOREIGN KEY 
  (animal_id) REFERENCES animals(animal_id)

Notice that now we are in the same situation we were in with our previous example: the contract between backend and persistence is based on cents (integers), but the one between frontend and backend is still based on floats.
We can go back to the “Refactoring” section and apply expand and contract between backend and frontend if we want to complete the fix.

Summary

We can safely apply expand and contract when the database is involved by using the double write technique.
In the money example, we reached the target state without causing any data loss or dropped transactions. However, four releases were necessary to achieve this. This is an example of the overhead given by CD.

However, not all applications have this requirement. It is important to check with the stakeholders if and when data loss is acceptable based on the nature of our software.

A note on NoSQL databases

Just because the database management system doesn’t enforce a strict schema on the data it doesn’t mean that applications don’t rely on the objects they retrieve being a certain shape.

Even if you are using MongoDB, Redis, DynamoDB, or just files… all of the steps above can apply. You should always be careful of what your code expects of any state which is stored in the outside world.
Migrating it however might be a little more tricky than our example with SQL.

Bringing it all together: making a Story Plan

We have seen how we might approach any given task in a Continuously Deployed system depending on the nature of the change:

  • approaching from the outside in when implementing a new feature (making sure to hide our work in progress with a feature toggle)
  • approaching from the inside out when altering already live functionality (applying expand and contract to respect each contract, and taking special care about data loss)

But in the real world, things are not so clear cut, and sometimes a task can be a bit of a mess of adding something new and changing something existing.

Let’s imagine the following user story for our last example:


As an animal shelter volunteer

I want to be able to specify which type of expense I am recording

So that I can build more accurate expense reports


Which would require adding a “type” dropdown in our well-known expense functionality.

It definitely requires changing the shape of something existing: expenses will now have a type (the existing ones could have a default of “food”). But also it is a new functionality as it allows the user to specify which type, and there is definitely a visual change there that might need to be hidden.
So which approach do we choose here? In which direction do we start?

Pre-Refactoring the system

We can apply the practice of “preparatory refactoring” to get the system into a state where adding the feature becomes a trivial change.

Whenever we get a task whose nature is mixed, or unclear, we could approach it by grouping all the changes which do not have any visible effect on the user so we can address them at the beginning. We can use our expand and contract workflow (for example adding fields with default values, stretching existing abstractions…) with them, and leave the feature addition to the very end.

This not only allows us to give ourselves a framework to work with, but it also reduces to the minimum the code that will end up under a feature toggle (and therefore the risk of release!).

The totality of the steps and commits we plan to achieve this can constitute our Story Plan.

Making a Story Plan

The final plan can be a diagram of which direction to follow, combined with a list of minimum code releases to safely follow the practices.

In our expenses types example, it might look something like this:

In the author’s experience, this should probably be very informal during development. A scribble on some post-it notes or a notebook would probably suffice. The purpose of this exercise would be to put ourselves in the good habit of taking dependencies explicitly into consideration at the beginning of a task – without going headfirst on the code.

Conclusions

With what we have talked about so far, we can summarize these four principles for practising Continuous Deployment safely:

However, if I had to leave the reader with just one thought, it would be this: when every commit goes to production we cannot afford to not know what the impact of our code is going to be once there, at every step of its readiness lifecycle. Even the intermediate ones. Starting with an incomplete picture of the target status of the codebase(s) is not enough anymore: we must spend the time investigating or spiking out our changes to map out how we are sending them live. (Even without necessarily agreeing with all the ways of working described in this article.)

In short: as amazing and liberating as CD might be compared to older ways of working, they also force us to take ourselves and our peers accountable to an even higher standard of professionalism and deliberateness over the code and tests we are checking in. Our users are just always a few minutes away from the latest version of our code, after all.

I hope this guide can be useful to even a couple of people considering adopting Continuous Deployment (or struggling with it). Feel free to send feedback in the comment or through any other private channel.

AWS, CI/CD, Docker

An Isolated Developer Setup with Docker

In this post I am going to propose a set up to run any kind of application on a developer laptop in complete isolation. It is based on packaging the application into a Docker container, and convince it is still talking to the real world while we’re actually mocking everything around it (spoiler: using more Docker containers).


I have used this in projects of various sizes – from small scale to really chunky applications with lots of intricate dependencies – and has generally proven itself to be worth the initial investment.

In this guide I will assume that the reader is starting from scratch, with an application that is run locally just from their IDE, with no real automation or containerisation. Feel free to skip any of the steps if they are redundant or do not apply to your situation.
I will also assume the reader is familiar with basic concepts of Docker and networking.

But – why?

I have been in quite a few of projects (even in really mature organisations) where any developer who wanted to look at their changes locally had to go through a set of annoying steps, or give up on running the application on their laptop completely. These steps may involve connecting to some VPN to reach other teams services, having set up authentication to the cloud provider so the application can connect to some resource it depends on, some data being set up somewhere for the local copy of the application to use, sometimes even an entire test environment is provisioned in the cloud just for this!

The premise of this post is that, based on my observations in those projects, such a situation is not sustainable in a modern software development team, and it negatively impacts developer productivity for the following reasons:

  • As much as we love our automated test suite and we are confident it will catch any regression, the ability to see features working end to end on a running copy of the application will always be needed for the developers’ peace of mind.
    If doing so on their laptops becomes annoying, humans will tend to follow the path of least resistance and verify their changes on some test environment in the cloud instead. This means more potentially broken revisions in the pipeline, and a strong urge to “close the gate” to check that everything looks as it should before production. All of which is a strong contradiction to the principles of Continuous Integration and Continuous Delivery we know and love.
  • Having dependencies to real world services is asking for brittleness, even if just for manual testing. There is no guarantee those services will be up, reachable and return an appropriate response.
  • As an extension of the previous point: even when third party services do return an appropriate response, it is not necessarily the one needed to be able to test the feature under development (or its edge cases). The data and messages coming into the system are out of our control. Our tests will be unreliable and at best cover only a happy path.
  • Every developer’s laptop is a potential snowflake in terms of tools and libraries installed, even which operating system it’s running. This is a slippery slope to everyone having a slightly different and not easily reproducible setup. Onboarding new people into the team takes longer and unpleasant “it works on my machine” experiences happen more often. Dev/production parity is also more distant.

Given all of this, it makes sense that a good developer setup should follow the principles of good Unit Tests: it should be runnable with one command, fully in control of everything interacting with the system under test, and should be just as fine running on a developer’s laptop on a plane.

What follows are the steps I recommend to arrive to such a setup, leveraging Docker containers. At the end of this guide you should be able to bring up a whole cluster of them with one command:

./run.sh

Note: as we are creating a system which is completely independent and closed to the outside world, we might find that the things we mock might change without us realising it. That’s why it’s very important to have contract tests in place for all of our dependencies.


Reproducing your deployable artifact

The first step is to identify how our application is currently deployed in a real production environment (as opposed to how one might like to run it in their IDE for example).
This happens with the creation of an artifact, which is usually a bundle of our executable (or source code if we are using an interpreted language) and all of its dependencies, so that it can be moved into its desired location on the server.
This could be a .jar or .war file, an entire directory, a binary file etc. etc.

What we want to do is to automate the creation of that artifact locally, following how it is done for production as closely as possible. Existing pipeline code, if it exists, might be a great place to look for this.
We can automate that as the first step of our ./run.sh script, which we will enrich until it’s able to run the whole application and its cluster of dependencies.

I will use a Java application which is built into a .jar file by a Gradle task as an example:

#!/bin/bash

echo "Building artifact..."

./gradlew clean assemble 

This will create a file build/libs/my-application.jar given a build.gradle file containing these instructions for the archive name

java {
    archivesBaseName = 'my-application'
}

Deploying the artifact in Docker

Once we have the artifact created, we can define a Docker image to wrap it, so that it can be independent of our local machine configuration and run as it would on a production server.
We will do so by creating a Dockerfile which will:

  • Download the base operating system image that the application should run on (or the most similar one we can find on the DockerHub)
  • Copy the artifact into the appropriate folder
  • Install any dependencies the application will need
  • Make any changes to the filesystem, network configuration etc. the application will need
  • Execute the command which runs our application as the last step

Here is how that could look like with our Java application

# Starting from Java 14 base image
FROM adoptopenjdk:14-jre-hotspot

# Making sure stuff is up to date
RUN apt-get update && apt-get upgrade -y

# Installing a dependency
RUN apt-get install -y <some-library-we-need> 

# Copying the "entrypoint" script which contains the command to run our application
ADD docker-entrypoint.sh /var/opt/my-application/docker-entrypoint.sh

# Making it executable
RUN chmod +x /var/opt/my-application/docker-entrypoint.sh

# Adding our artifact too
ADD build/libs/my-application.jar /var/opt/my-application/my-application.jar

# Defining our script as our entry point
ENTRYPOINT ["/var/opt/docker-entrypoint.sh"]

If you are new to Docker or confused about any of the above, follow the Dockerfile reference.

We can then create the ./docker-entrypoint.sh file which will be used to run our application. It should contain as little code as possible, ideally just one command, as everything should have been set up by the Dockerfile.

#!/usr/bin/env bash

echo "Starting my application on port 8080"

exec java -Dserver.port=8080 \
               -DSOME_VARIABLE_I_NEED=${SOME_VALUE} \
               -XX:SomeOtherJvmOptions \
               -jar my-application.jar

This configuration can be tested by building the image

docker build . -t "my-application"

Which will perform every step present in the Dockerfile and create an image called “my-application”.

And then running it as a container with

docker run -p 8080:8080 my-application

Which will invoke the docker-entrypoint.sh script, making sure that port 8080 is forwarded to the host.
If everything went well, the application should be reachable at http://localhost:8080, although it might still not behave correctly or struggle to start since we haven’t tweaked its configuration to work in Docker yet. We will see later how to fix that, for now we can just get it as close to working as possible.

As a last step, we want to declare how the image should be built and run into a docker-compose.yaml file instead of using the two commands above. This is not really necessary if we are dealing with just one service, but will be handy for us as we want to create others in the following steps (and docker-compose makes it sooo much more convenient to deal with multiple services).

version: '3'
services:
  my-application:
    build: .
    ports:
      - "8080:8080"

Thanks to this file, we can just run docker-compose build and docker-compose up and the application will magically be built and run with all the parameters we have defined, instead of passing them all as command line arguments.
We will also need to call docker-compose down if our script gets interrupted in order to tidy up properly.
We can add this to our ./run.sh script.

#!/bin/bash

echo "Building artifact..."

./gradlew clean assemble 

echo "Building image..."
docker-compose build my-application

trap 'docker-compose down' 1 3 15 #Intercepts signals which will stop our scripts, and executes the commands in quotes before exiting

echo "Starting application..."
docker-compose up

echo "Build done"

We can also add a command line argument that allows to skip re-building everything before running, just in case we want to see our application without any change to the source code:

#!/bin/bash

FAST=false


USAGE="
Usage: run [-f]
Options:
  -f    Fast mode. Does not rebuild service.
"

while getopts ":fhdi" opt; do
  case ${opt} in
    f ) FAST=true
      ;;
    h ) echo "$USAGE"; exit 0
      ;;
    \? ) echo "$USAGE"; exit 1
      ;;
  esac
done

if [ "$FAST" = true ]; then
    echo "Running in fast mode. Not rebuilding artifact or image"
else
    echo "Building artifact..."

    ./gradlew clean assemble 

    echo "Building image..."
    docker-compose build my-application
fi

trap 'docker-compose down' 1 3 15

echo "Starting application..."
docker-compose up

This allows us to invoke our script with

./run.sh -f

if we want to skip rebuilding the application image, and without arguments otherwise.

We have successully set up Docker around our application. Still, we haven’t changed its configuration at all and its dependencies don’t exist yet, so it will probably not work right away.
We will be fixing that in the following sections.


Creating a new configuration

We need to create another configuration environment for our application, to make it aware that it’s deployed within Docker, and so we can tweak all the settings we need to get it running.
How that looks like will be highly dependent on the language and framework being used, but generally in most setups we have sets of key value pairs that are different for each environment, maybe stored in different files.
For example, if our Java application was using with Spring, we might have some existing application.properties file for deploying our application to production, and we could make an application-docker.properties for our new Docker setup:

spring.application.name=my-application
server.port=8080
base.url=http://localhost:8080
server.ssl.enabled=false

Make sure that you add all parameters relevant to your setup and that the application is told to use the new profile when run inside Docker.
For our java application, that can be done through an environment variable passed to the container in the docker-compose.yaml:

version: '3'
services:
  my-application:
    build: .
    ports:
      - "8080:8080"
    environment:
      - ENVIRONMENT="docker"

And then sent to Spring via the docker-entrypoint.sh

#!/usr/bin/env bash

echo "Starting my application on port 8080"

exec java -Dserver.port=8080 \
               -Dspring.profiles.active="docker" \
               -DSOME_VARIABLE_I_NEED=${SOME_VALUE} \
               -XX:SomeOtherJvmOptions \
               -jar my-application.jar

Usually a major part of configuration is the addresses at which the application can reach its dependencies in that particular environment, which we have conveniently left out in this section.
We will see how to configure those in the rest of the guide, as each type of external dependency will be reached in a different way.


Mocking 3rd party services

Most application make use of external persistence or messaging services.
We will look at how to mock them as another Docker container in our setup and have our application configured to talk to the mock instead of the real thing.
In this example we will pretend our Java application needs a Mongo Atlas cluster to run in production, which we will replace locally with a simple MongoDB container.

The first step is always to look for an official Docker image of the third party service we want to mock, which in most cases will be available on Docker Hub.
We will use the mongo image we found, which also conveniently allows to initialise any data needed in the database by adding JavaScript files in a /docker-entrypoint-initdb.d/ folder.
So we need to set up a Dockerfile that starts from that base image

FROM mongo:latest

COPY init-collections.js /docker-entrypoint-initdb.d/init-collections.js

And some init-collections.js which could contain simple initialization code like

db.createCollection("myCollection");
db.myCollection.insert([
    {
         "_id": "an-id", 
         "value": "something I need in my db for the application to start"
    }
]);

Since our setup is getting a bit crowded now, we can store everything related to this mock under a separate folder, obtaining this structure:

.
├── run.sh
├── Dockerfile
├── docker-compose.yaml
├── docker-entrypoint.sh
├── mocks
│   └── mongo
│       ├── Dockerfile
│       └── init-collections.js
└── src

Note on seeding data: official images for well known third party services will have some way to pre-populate some data or schema like MongoDB in the example, and hopefully can be used in the Docker file so that the step happens at image building time.
However, some official images do not have such an API.
We can bypass that restriction by adding a custom CMD in the Dockerfile: it should execute the service (in the background) plus your own script immediately after, then it can sleep to keep the container alive like this:

CMD bash -c "start-service --background="true" && /location/my-data-population-script.sh && sleep infinity"

Once we have our Dockerfile ready and a strategy to pre-populate any data or configuration we need, we can add our new mock as a service to the docker-compose.yaml file, specifying that the application depends on it:

version: '3'
services:
  mongo:
    build: mocks/mongo
    restart: always
    ports:
      - "27017:27017"
    environment:
      - MONGO_INITDB_ROOT_USERNAME=user
      - MONGO_INITDB_ROOT_PASSWORD=super-secure-password
      - MONGO_INITDB_DATABASE=my-application-db
  my-application:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - mongo # This will make sure mongo is up before our application is started

Finally we can change our application’s configuration for running in the Docker environment so that it talks to our mongo container instead of trying to connect the real Mongo Atlas cluster over the internet.
We will leverage the networking features of docker-compose, which allow a container to be able to resolve the name of any service declared within the same docker-compose.yaml file.
This allows our application container to be able to resolve the name “mongo” to the correct container IP address without any further configuration of the Docker network:

spring.application.name=my-application
server.port=8080
base.url=http://localhost:8080
server.ssl.enabled=false

### External dependencies
mongo.url="mongodb://user:super-secure-password@mongo:27017/admin"


We don’t need to change how we run docker-compose in our ./run.sh script, because by default it will start all declared services in the correct order by simply doing docker-compose up, so we can test it immediately.
However, we probably don’t want to rebuild the mocks images every time, as they will change much less frequently than our application image and rebuilding them would slow down the script unnecessarily.
So we can add another flag to our ./run.sh script to rebuild the service dependencies only if explicitly asked to do so:

#!/bin/bash

FAST=false
BUILD_DEPENDENCIES=false

USAGE="
Usage: run [-f] [-d] [-h]
Default behavior: rebuild application image, but not dependencies.

Options:
  -d    Rebuild dependencies images.
  -f    Fast mode. Does not rebuild service or dependencies. Will override -d
  -h    Displays this help
"

while getopts ":fhdi" opt; do
  case ${opt} in
    f ) FAST=true
      ;;
    d ) BUILD_DEPENDENCIES=true
      ;;
    h ) echo "$USAGE"; exit 0
      ;;
    \? ) echo "$USAGE"; exit 1
      ;;
  esac
done

if [ "$FAST" = true ]; then
    echo "Running in fast mode. Not rebuilding artifact or image"
elif [ "$BUILD_DEPENDENCIES" = true ]; then
  echo "Rebuild dependencies option specified. Will rebuild all images"

  echo "Building artifact..."

  ./gradlew clean assemble 

  docker-compose build # This builds all images
else
    echo "Building artifact..."

    ./gradlew clean assemble 

    echo "Building image..."
    docker-compose build my-application # Builds only application image
fi

trap 'docker-compose down' 1 3 15

echo "Starting application..."
docker-compose up

We will now be able to invoke our script with

./run.sh -d

If we have made any change to the mocks supporting the application, like in the seed data for example.
We will just run it without arguments otherwise.


Mocking other team’s custom services

Not all services our application depends on are open source or belong to a well known third party. Sometimes our application’s dependencies lie within the same organisation, as we need for example to collaborate with services developed custom by another team or vendor.

This means we need to create our own stub of their API, which can be done in different ways. I usually do it by making a very simple Node.js web server in an index.js file (it has a good balance between simplicity and ease of adding tiny bits of logic if needed).

var http = require('http');

console.log("Mock 3rd party service listening on port 3000");

const stubResponse = {"key" : "value"};

http.createServer((req, res) => {
  console.log("Stub response requested");
  res.writeHead(200, {'Content-Type': 'application/json'});
  res.end(JSON.stringify(stubResponse));
}).listen(3000);

We can include it in a very simple Dockerfile that relies on the base Node.js image:

FROM node:latest

COPY index.js /opt/index.js

CMD ["node", "/opt/index.js"]

Which we can also add under the mocks folder next to our previously created one

.
├── run.sh
├── Dockerfile
├── docker-compose.yaml
├── docker-entrypoint.sh
├── mocks
│   └── other-team-service
│       ├── Dockerfile
│       └── index.js
│   └── mongo
│       ├── Dockerfile
│       └── init-collections.js
└── src

And finally add it as a dependency of our application in docker-compose.yaml file

version: '3'
services:
  other-team-service:
    build: mocks/other-team-service
    ports:
      - "3000:3000"
  mongo:
    build: mocks/mongo
    restart: always
    ports:
      - "27017:27017"
    environment:
      - MONGO_INITDB_ROOT_USERNAME=user
      - MONGO_INITDB_ROOT_PASSWORD=super-secure-password
      - MONGO_INITDB_DATABASE=my-application-db
  my-application:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - mongo
      - other-team-service

And wherever needed in the configuration, using again Docker’s name resolution features

spring.application.name=my-application
server.port=8080
base.url=http://localhost:8080
server.ssl.enabled=false

### External dependencies
mongo.url="mongodb://user:super-secure-password@mongo:27017/admin"
other-team-service.url="http://other-team-service:3000"

Mocking your Cloud Provider

Perhaps the most daunting task of isolating an application from everything around it is mocking cloud provider services which are invoked directly, like functions, file storage, queues, secrets manager etc.
In this example we will focus on how to mock AWS in particular using a tool called Localstack (which natively works really well with Docker).
Azure or Google Cloud Plaform also have their own ways of reproducing their services locally, so it is worth to check their documentation too, although they are out of the scope of this guide.

Localstack will run in a Docker container and pretend to be AWS by mimicking its API, and it is very configurable: we can choose which AWS services we want to enable by passing environment variables, and we can also initialise any configuration or data we need through scripts placed in a /docker-entrypoint-initaws.d/ folder (similarly to the MongoDB container from earlier).

Let’s start by creating a Dockerfile that will use it as a base image and set up our script in the right folder

FROM localstack/localstack:latest

COPY populate-aws.sh /docker-entrypoint-initaws.d/populate-aws.sh

The populate-aws.sh scripts can contain basic instructions given through the awslocal command (that behaves like the official AWS cli), for example we can create some S3 buckets, SSM parameters, SQS queues…

#!/bin/bash

awslocal s3 mb s3://my-bucket
awslocal ssm put-parameter --region="eu-central-1" --name "/name/space/my-secret" --type SecureString --value "SuperSecretParameter!" --overwrite
awslocal sqs create-queue --region="eu-central-1" --queue-name "my-queue"

And they will be initialised as soon as localstack is up.
Again, let’s place these two files in their own folder under mocks

.
├── run.sh
├── Dockerfile
├── docker-compose.yaml
├── docker-entrypoint.sh
├── mocks
│   └── localstack
│       ├── Dockerfile
│       └── populate-aws.sh
│   └── other-team-service
│       ├── Dockerfile
│       └── index.js
│   └── mongo
│       ├── Dockerfile
│       └── init-collections.js
└── src


Then we need to add the new service to docker-compose.yaml and specify which AWS features we would like it to start, plus a few more settings it needs to work (more info on the Localstack documentation for docker-compose)

version: '3'
services:
  localstack:
    build: mocks/localstack
    ports:
      - "4566-4584:4566-4584"
    environment:
      - DEFAULT_REGION=eu-central-1
      - SERVICES=ssm,s3,sqs
      - HOSTNAME_EXTERNAL=localstack
    volumes:
      - "${TMPDIR:-/tmp/localstack}:/tmp/localstack"
  other-team-service:
    build: mocks/other-team-service
    ports:
      - "3000:3000"
  mongo:
    build: mocks/mongo
    restart: always
    ports:
      - "27017:27017"
    environment:
      - MONGO_INITDB_ROOT_USERNAME=user
      - MONGO_INITDB_ROOT_PASSWORD=super-secure-password
      - MONGO_INITDB_DATABASE=my-application-db
  my-application:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - mongo
      - other-team-service
      - localstack

This should be enough to make Localstack run. But how do we tell our application to use that instead of connecting to the real AWS?
Often a cloud provider is used through its SDK throughout the whole application, so we can’t just tweak a single configuration parameter to make it work, as we might have done with other kinds of mocks that are under our control.

Luckily, the AWS SDK has a way to override which endpoint it will contact to talk to AWS. This feature was developed to bypass corporate proxies, but it can also be used to point to localstack.
For example, in Java this is how we can do it for SQS:

//...
    @Bean
    public ConnectionFactory connectionFactory(@Value("${aws.local.endpoint:#{null}}") String awsEndpoint) { //AWS endpoint will only be set in docker profile
        LOG.info("Endpoint SQS: " + awsEndpoint);
        AmazonSQSClientBuilder builder = AmazonSQSClientBuilder.standard();
        if (awsEndpoint != null) {
            builder.withEndpointConfiguration(
                    new AwsClientBuilder.EndpointConfiguration(awsEndpoint, "eu-central-1") // Override with localstack endpoint if present
            );
        } else {
            builder.withRegion("eu-central-1");
        }

        builder.withCredentials(awsCredentialsProvider);

        return new SQSConnectionFactory(new ProviderConfiguration(), builder);
    }
//...

With the aws.local.endpoint property specified in the docker properties file:

spring.application.name=my-application
server.port=8080
base.url=http://localhost:8080
server.ssl.enabled=false

### External dependencies
mongo.url="mongodb://user:super-secure-password@mongo:27017/admin"
other-team-service.url="http://other-team-service:3000"

### Override AWS endpoint with localstack
aws.local.endpoint=http://localstack:4566

The clients for other AWS services (and for all other languages) all allow to change this configuration, so we can do the same with pretty much any other service we need.
Please refer to the AWS SDK documentation on overriding endpoint configuration for more info.

This should be the only change to application code which is necessary to run this setup.

Code Recap

After following the steps above, the application should now be able to start without issues with ./run.sh and have everything it needs to do its job.
If errors persist, make sure that all necessary variables, data, and stubs have been set up correctly.

In following posts we will see how to use what we have just created not just for manual testing, but for automated end to end testing of the application like a black box as well.

Below is the code for the full setup:

Structure

.
├── run.sh
├── Dockerfile
├── docker-compose.yaml
├── docker-entrypoint.sh
├── mocks
│   └── localstack
│       ├── Dockerfile
│       └── populate-aws.sh
│   └── other-team-service
│       ├── Dockerfile
│       └── index.js
│   └── mongo
│       ├── Dockerfile
│       └── init-collections.js
└── src

Root level

#!/bin/bash

FAST=false
BUILD_DEPENDENCIES=false

USAGE="
Usage: run [-f] [-d] [-h]
Default behavior: rebuild application image, but not dependencies.

Options:
  -d    Rebuild dependencies images.
  -f    Fast mode. Does not rebuild service or dependencies. Will override -d
  -h    Displays this help
"

while getopts ":fhdi" opt; do
  case ${opt} in
    f ) FAST=true
      ;;
    d ) BUILD_DEPENDENCIES=true
      ;;
    h ) echo "$USAGE"; exit 0
      ;;
    \? ) echo "$USAGE"; exit 1
      ;;
  esac
done

if [ "$FAST" = true ]; then
    echo "Running in fast mode. Not rebuilding artifact or image"
elif [ "$BUILD_DEPENDENCIES" = true ]; then
  echo "Rebuild dependencies option specified. Will rebuild all images"

  echo "Building artifact..."

  ./gradlew clean assemble 

  docker-compose build # This builds all images
else
    echo "Building artifact..."

    ./gradlew clean assemble 

    echo "Building image..."
    docker-compose build my-application # Builds only application image
fi

trap 'docker-compose down' 1 3 15

echo "Starting application..."
docker-compose up
version: '3'
services:
  localstack:
    build: mocks/localstack
    ports:
      - "4566-4584:4566-4584"
    environment:
      - DEFAULT_REGION=eu-central-1
      - SERVICES=ssm,s3,sqs
      - HOSTNAME_EXTERNAL=localstack
    volumes:
      - "${TMPDIR:-/tmp/localstack}:/tmp/localstack"
  other-team-service:
    build: mocks/other-team-service
    ports:
      - "3000:3000"
  mongo:
    build: mocks/mongo
    restart: always
    ports:
      - "27017:27017"
    environment:
      - MONGO_INITDB_ROOT_USERNAME=user
      - MONGO_INITDB_ROOT_PASSWORD=super-secure-password
      - MONGO_INITDB_DATABASE=my-application-db
  my-application:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - mongo
      - other-team-service
      - localstack
# Starting from Java 14 base image
FROM adoptopenjdk:14-jre-hotspot

# Making sure stuff is up to date
RUN apt-get update && apt-get upgrade -y

# Installing a dependency
RUN apt-get install -y <some-library-we-need> 

# Copying the "entrypoint" script which contains the command to run our application
ADD docker-entrypoint.sh /var/opt/my-application/docker-entrypoint.sh

# Making it executable
RUN chmod +x /var/opt/my-application/docker-entrypoint.sh

# Adding our artifact too
ADD build/libs/my-application.jar /var/opt/my-application/my-application.jar

# Defining our script as our entry point
ENTRYPOINT ["/var/opt/docker-entrypoint.sh"]
#!/usr/bin/env bash

echo "Starting my application on port 8080"

exec java -Dserver.port=8080 \
               -Dspring.profiles.active="docker" \
               -DSOME_VARIABLE_I_NEED=${SOME_VALUE} \
               -XX:SomeOtherJvmOptions \
               -jar my-application.jar
java {
    archivesBaseName = 'my-application'
}

mocks/mongo folder

FROM mongo:latest

COPY init-collections.js /docker-entrypoint-initdb.d/init-collections.js
db.createCollection("myCollection");
db.myCollection.insert([
    {
         "_id": "an-id", 
         "value": "something I need in my db for the application to start"
    }
]);

mocks/other-team-service folder

FROM node:latest

COPY index.js /opt/index.js

CMD ["node", "/opt/index.js"]
var http = require('http');

console.log("Mock 3rd party service listening on port 3000");

const stubResponse = {"key" : "value"};

http.createServer((req, res) => {
  console.log("Stub response requested");
  res.writeHead(200, {'Content-Type': 'application/json'});
  res.end(JSON.stringify(stubResponse));
}).listen(3000);

mocks/localstack folder

FROM localstack/localstack:latest

COPY populate-aws.sh /docker-entrypoint-initaws.d/populate-aws.sh
#!/bin/bash

awslocal s3 mb s3://my-bucket
awslocal ssm put-parameter --region="eu-central-1" --name "/name/space/my-secret" --type SecureString --value "SuperSecretParameter!" --overwrite
awslocal sqs create-queue --region="eu-central-1" --queue-name "my-queue"

Inside the application:

spring.application.name=my-application
server.port=8080
base.url=http://localhost:8080
server.ssl.enabled=false

### External dependencies
mongo.url="mongodb://user:super-secure-password@mongo:27017/admin"
other-team-service.url="http://other-team-service:3000"

### Override AWS endpoint with localstack
aws.local.endpoint=http://localstack:4566
//...
    @Bean
    public ConnectionFactory connectionFactory(@Value("${aws.local.endpoint:#{null}}") String awsEndpoint) { //AWS endpoint will only be set in docker profile
        LOG.info("Endpoint SQS: " + awsEndpoint);
        AmazonSQSClientBuilder builder = AmazonSQSClientBuilder.standard();
        if (awsEndpoint != null) {
            builder.withEndpointConfiguration(
                    new AwsClientBuilder.EndpointConfiguration(awsEndpoint, "eu-central-1") // Override with localstack endpoint if present
            );
        } else {
            builder.withRegion("eu-central-1");
        }

        builder.withCredentials(awsCredentialsProvider);

        return new SQSConnectionFactory(new ProviderConfiguration(), builder);
    }
//...
AWS, Docker

How to attach a remote profiler to a JVM running in EC2 (and maybe Docker)

Part of running big distributed systems at scale is encountering issues which are hard to debug. Memory leaks, sudden crashes, threads hanging… they might all manifest under extreme production conditions, but never in our laptops or test environments.

That’s why sometimes we might need to go straight to the source, and be able to profile a single JVM which is under real production load.

This guide aims to show how we can attach a profiler to a running application when the network, AWS permissions or even a layer of containerisation might be in the way.

We will achieve this by making use of a profiler agent running next to the remote JVM, which will send data to our profiler client. The two will be connected by an SSH tunnel.

Installing the profiler client locally

For the purposes of this guide we will be using JProfiler. Although any profiler which can work with an agent and a client can be used by following the same principles.

Here is the download link for the JProfiler client: https://www.ej-technologies.com/download/jprofiler/files

Installing and attaching the profiler agent on the running instance (no Docker)

Connect to your running instance with SSH or SSM Session Manager.
The first thing you will need to do is download the profiler agent in it.
For JProfiler, you can run

$ wget -O /tmp/jprofiler_linux_12_0.tar.gz https://download-gcdn.ej-technologies.com/jprofiler/jprofiler_linux_12_0.tar.gz

The tar file then needs to be extracted

$ tar -xzf /tmp/jprofiler_linux_12_0.tar.gz -C /usr/local

And it can be attached to the already running JVM by invoking the agent with

$ /usr/local/jprofiler12.0/bin/jpenable -g -p 1337

Where 1337 is the port I want the profiler agent to use in this example.

Installing and attaching the profiler agent on the running instance (with Docker)

Things get a little more complicated with Docker in the middle, as the profiler agent will need to be installed into the container, and we need to forward the profiler traffic arriving to the host.

Setting up the port forwarding

Unfortunately, our application container might need to be started with the port forwarding already in place.
This means if you cannot afford to restart the container in the production instance directly, your deployment setup will likely have to change.

You can forward the profiler port by adding the option -p 1337:1337 on the docker run command that you use to start your application container, or the equivalent option on your docker-compose file.

Setting up the agent

Now that the traffic will be forwarded from the host to the container, we can work on setting up the agent on the same port. To do so we can connect to the EC2 instance and download the agent into it like in the previous step by running

$ wget -O /tmp/jprofiler_linux_12_0.tar.gz https://download-gcdn.ej-technologies.com/jprofiler/jprofiler_linux_12_0.tar.gz

But then we will need to copy the agent file into the container through the docker cp command

$ docker cp /tmp/jprofiler_linux_12_0.tar.gz <container-name>:/tmp/

Extract in the container it (as root) with docker exec

$ docker exec -u 0 <container-name> /bin/bash -c 'tar -xzf /tmp/jprofiler_linux_12_0.tar.gz -C /usr/local'

And run it (as the same user owning the JVM)

$ docker exec -u <jvm-owner-user-id> <container-name> /bin/bash -c '/usr/local/jprofiler12.0/bin/jpenable -g -p 1337'

Where 1337 is the port I want the profiler agent to use in this example.

Now we should have a set up where any traffic arriving to port 1337 on the host will be forwarded to the profiler agent.

Setting up the SSH Tunnel

The last step is to connect the profiler client and profiler agent over the network.

We can use the SSH local port forwarding feature for this purpose, and create a tunnel between our machine’s port 1337 and the EC2 instance’s port 1337.

We can achieve this by running the following command locally:

$ ssh -i <path-to-your-instance-key> -L 1337:localhost:1337 <user>@$<instance-ip-or-dns-name>

No SSH access to the EC2 instance? No problem.

Sometimes we don’t have a key pair for a particular instance, or we don’t have the necessary network configuration to reach it over SSH, or we don’t have the port to do it. That’s fine.

In order to bypass that limitation and still obtain a tunnel we can make use of the ProxyCommand feature of SSH and send our traffic through an SSM command ran with the AWS CLI.

This will leverage the AWS API and existing IAM permissions to authenticate us.

$ ssh  -L 1337:localhost:1337 <user>@$<instance-ip-or-dns-name> -o ProxyCommand="sh -c \"aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'\""

Make sure that you have all of the necessary AWS CLI environment variables or profiles for authenticating yourself (and the permissions to issue the command).

More info on the AWS Documentation

Connecting the Client

If everything went well, we should now be able to start the JDepend client on our local machine.
We need to set up the client to connect to a JVM “on another computer” in the following menu:

Image result for jprofiler remote

But instead we will be connecting to “localhost” or 127.0.0.1, and according to our example the profiling port will be 1337.

After you’ve selected all your favourite options, if you start seeing some output like the following

then it means everything has worked, and you can finally start poking around your heap, threads and whatnot.

Java, Software Architecture

Visualising distance from the main sequence and other Clean Architecture metrics in Java

Image result for clean architecture zone of uselessness

I must not have been the only one to read “Clean Architecture” by Uncle Bob (Robert Martin) and be immediately sold on the abstractness, instability, coupling and main sequence metrics.
I must not have been the only one to immediately Google for tools to generate them for whichever codebase I happened to be working on at the moment, anxious to see if my refactoring instincts could be backed by a pretty diagram.
And yet, based on the very disappointing (lack of) results, it seems like that might be the case.

There are a few tools for the job yes, but they are clunky to run at best, and they definitely do not produce a visual output to quickly get insights from.

After hours of tinkering and getting a result, I thought I would share and write this guide for others like me who might have gone through the same experience. Hopefully it will spare some time to the next person.

Let’s begin!

Step 1: Installing JDepend

Fortunately, most of the heavy lifting has already been made by the authors of this really nice tool: https://github.com/clarkware/jdepend. It will allow us to generate a report of our packages dependencies in an XML format (which we’ll need for our visualization).
From the documentation:

JDepend traverses Java class and source file directories and generates design quality metrics for each Java package. JDepend allows you to automatically measure the quality of a design in terms of its extensibility, reusability, and maintainability to effectively manage and control package dependencies.

From the JDepend README.

Unfortunately, there is not much more information than this in the main README file. Proper installation instructions are a bit buried (you would need to download the repo and then open the html doc files in some browser… ugh) and they are a bit complicated. I’ll list them here so you don’t have to look for them.

First, we want to make a folder which is going to be our workspace for using JDepend and open it with our terminal.

Then, we want to download the latest major release as a zip file (the url points to the zip in the dist folder of the repo):

$ wget -O jdepend-2.10.zip https://github.com/clarkware/jdepend/blob/master/dist/jdepend-2.10.zip\?raw\=true

Then unzip the file

$ unzip jdepend-2.10.zip

Set the unzipped directory as our $JDEPEND_HOME.

$ export JDEPEND_HOME="$(pwd)/jdepend-2.10"

Finally we will need to change the file permissions

$ chmod -R a+x $JDEPEND_HOME

And add the jar file in the unzipped folder to our classpath:

$ export CLASSPATH=$CLASSPATH:$JDEPEND_HOME/lib/jdepend-2.10.jar

Congrats! JDepend is now ready to be used. (Yes, this was the simplified version).

Step 2: Generating the XML report

On the documentation we see that

JDepend provides a graphical, textual, and XML user interface to visualize Java package metrics, dependencies, and cycles.

However, if we take a look at the graphical interface we realise that it is a bit… old fashioned.

And it’s very far from the beautiful diagrams we imagined by reading at the Clean Architecture anyway.

The textual interface also doesn’t help much:

--------------------------------------------------
- Summary:
--------------------------------------------------

Name, Class Count, Abstract Class Count, Ca, Ce, A, I, D, V:
org.springframework.jms.annotation,0,0,1,0,0,0,1,1
org.springframework.jms.config,0,0,1,0,0,0,1,1
org.springframework.security.access.prepost,0,0,4,0,0,0,1,1
org.springframework.stereotype,0,0,56,0,0,0,1,1
org.springframework.test.annotation,0,0,1,0,0,0,1,1
org.springframework.test.context,0,0,3,0,0,0,1,1
...

But fortunately, JDepend also offers the possibility to get the output in XML format, so that we can use it to generate any other kind of visualisation we like.
That is exactly what we are going to do with this command:

$ java jdepend.xmlui.JDepend -file report.xml <path-to-the-root-of-your-java-project>/build

Which should produce a report.xml file. We’ll see how to produce the visualization in the next section.

Step 3: Installing JDepend-UI

jdepend-ui is a little JavaScript based tool I hacked together to transform the XML report into a somewhat useful html page with some insights, that can be more easily navigated than the old JDepend interfaces.

All you need to do to install it is to clone the repo

$ git clone git@github.com:ValentinaServile/jdepend-ui.git

Make sure you have node and npm installed, and run:

$ cd jdepend-ui && npm install

Step 4: Generating the final report

Now that you have jdepend-ui installed, we can use it to generate a much nicer HTML visualization.

We can do that by running the command

npm run jdepend-ui <path-to-xml-report-file> <your-packages-prefix>

Where the path to your XML report should be ../report.xml at this point (if you have followed all the steps in this guide to the letter).

Your package prefix is something like “com.yourcompany.yourservice”. The tool will use it to filter out the metrics that belong to external packages so that you only see the ones which you actually wrote.

If everything went okay, you should now have an index.html file in your working directory.
If you open it with a web browser, you should see something like:

Sample output

In the page you can see the same graph as we saw in Clean Architecture generated from the JDepend metrics. Every “dot” on the screen represents one of your packages. There is a “general” section which has average and median distance from the main sequence scores for the entire codebase.

If you click on any of the dots, the package specific details on the right side of the screen will appear: they contain the package name, couplings count, abstractness and instability scores, its distance from the main sequence and finally a list of which other packages use it and which other packages it uses.

You can also search for a package by name with the search bar.

CI/CD, Docker

Solving the Docker in Docker dilemma in your CI Pipeline

There are some tests (integration, end to end, component…) in most modern test suites that rely on some external resources in order to run. Confusing industry terminology aside, their goal is to test the integration between some part of our application and the outside world, like a database or a queue or even some other team’s application.

This is often accomplished by having a Docker container pretend to be the external entity we wish to mock – which is easy enough to set up in a developer’s laptop:

However, things get a bit more tricky once the same test setup has to be run in the team’s CI pipeline.

The reason being, a lot of modern CI/CD tools like Jenkins, GoCD etc. rely on their build agents being Docker containers themselves.

This presents developers who wish to run integration tests in their CI with the task of spawning Docker containers from within other Docker containers, which is not trivial as we are about to see.

Making Docker in Docker work

It used to not be possible to run Docker inside Docker because of some low level black magic not being available in an underprivileged environment like container, which is needed by Docker to run.

All of that changed with this update, which introduced the concept of privileged mode, where a container can effectively be run with almost all the capabilities of the host machine.
You can leverage privileged mode to create a container that is able to spawn other containers (particularly, you will want to run your agent image in this way).

You can do so by adding the --privileged flag like this:

$ docker run --privileged <agent-image> <command>

Or, if you are using docker-compose:

version: '3'
services:
  myagent:
    # other things
    privileged: true

You need to make sure that Docker is installed in the agent image you are using, along with all of its dependencies. You can refer to the article above and/or to the Dockerfile of the dind (Docker in Docker) image: https://github.com/jpetazzo/dind/blob/master/Dockerfile.

Running the tests

If you managed to configure your agents in this way, you should be able to run your tests with the same setup as you had in the local environment.

Your container will be simply started by the now privileged agent instead of the host machine, but all of that should be transparent to the application.

Issues

However simple it looks , this approach presents some undesired complications:

  • There is a long list of dependencies to install in order for Docker to work inside the container, some or all of which might not be present in the images you’re working with. Therefore there might be quite a lot of fiddling to do just to get it running.
  • Docker in Docker is still not 100% free of low level, hard to debug issues, as this article sums up.
  • Running containers in privileged mode is dangerous, and opens the possibility of privilege escalation by malicious users of CI server. Since our CI servers are also usually deploying infrastructure (and they need the necessary roles and permissions to destroy it), they can be one of the most dangerous systems in our organisation to leave poorly secured.

A better approach: shared docker.sock

For most pipelines there is not a real need to have docker containers used for tests running within the agent container.
The requirement most often is just rather to be able to start a docker container from some test code running within the agent, but then the container could theoretically run anywhere that can be reached by our application for the purpose of our tests.
For this reason, another possible setup for our pipeline is Docker beside Docker, instead of Docker within Docker.

We can achieve this setup by sharing the UNIX socket file used by docker as a volume inside the agent container. In order to understand how this works, let’s first go through a high level refresher of the Docker architecture.

The Docker Architecture

Docker uses three high level components in order to work:

  • Docker Daemon  is the persistent process that manages containers in the background. Docker uses different binaries for the daemon and client.
  • Unix Socket: The docker.sock is the UNIX socket that the Docker daemon is listening to. It’s the main entry point for the Docker API and what the client will send the commands to. It is located in /var/run.
  • Docker Client the base binary that provides the Docker CLI to the user. Communicates with the daemon through the docker.sock

The same socket can be used by multiple clients, which makes this alternative solution possible.

Sharing the docker.sock

We want a docker client inside our pipeline agent to the host machine’s Docker socket. This way the agent is able to start “sibling” containers by talking to the daemon running on the host (instead of its own daemon like in the previous solution).

We can achieve this by sharing the host’s /var/run/docker.sock file to the agent container as a volume.

The Docker client on the agent would then speak to /var/run/docker.sock as if it was its own local one.

This is the command we would run with the CLI

$ docker run -v /var/run/docker.sock:/var/run/docker.sock  <agent-image> <command>

And the service definition if we are using docker-compose

version: '3'
services:
  myagent:
    # other things
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Running the tests

Unfortunately, unlike the previous solution, we must now take into account that the container we are talking to is not a child of the current one.
Therefore, any network setup that relies on that being the case will not work anymore.
As an example, locally and in the previous solution we might have been running our tests against localhost:port by making use of the port forwarding feature like this:

However when the daemon is shared the same port forwarding will actually refer to the host’s view of “localhost”, not the agent’s.
Therefore we have to re-think our approach and refer to the container by its IP address instead:

You can get a container’s IP address in a number of ways, like this command through the Docker CLI
$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container>

Or programmatically, for example if you’re using Java and the Test Containers library

String containerIp = container.getContainerIpAddress();

The only problem with IP addresses of containers is they are not very reliable and might change, so we might prefer a more stable, non programmatic way to refer to our containers in the code.

Docker happens to offer a DNS resolution service, where it assigns stable names to containers which are in the same user defined network. We can leverage this feature by fiddling around and trying to get the database container and the agent container on the same network, for example with this connect command:

$  docker network connect <database-network-name> <agent-container-name>

We should then be able to refer to the database container via its container name in our tests, like my-db:<port>. This is especially convenient when using docker-compose as we have the network already sorted out for us, with predictable default container names.


More details on DNS resolution and containers networking in the official documentation.

Conclusion

There are two main ways to run docker containers from within our CI pipeline agents:

  • as children of our agent container, which might be complicated and open security holes
  • as siblings of our agent container, which is much more straightforward to setup and better indicated for security, but presents some overhead for the application to be able to reach the container

The second approach is definitely the most popular, even according to the people who wrote the feature which allows for Docker to run within Docker, but ultimately it is a tradeoff for the developers to consider.

JavaScript, Snippets

The same code with Callbacks vs Promises vs Async/Await

Sometimes in JavaScript to have to deal with different flavours of asynchronous code, so it is handy to be able to map back and forth between them.

Callback

Functions that do something asynchronously are typically implemented using the callback pattern, and their implementation might look like this:

const myAsyncFunction = (parameter, callback) => {
    const result = //do something async here...
    callback(result, error);
};

They may be invoked in the following way:

const main = () => {
    myAsyncFunction("parameter", (result, error) => {
       //use the result or error here
    });
};

However, things may quickly get out of hand if we need the result of an asynchronous function to invoke another asynchronous function, and then we need that to invoke another one and so on…:

const main = () => {
    myAsyncFunction("parameter", (result, error) => {
       myOtherAsynchronousFunction(result, (otherResult, otherError) => {
           myFinalAsynchronousFunction(otherResult, (finalResult, finalError) => {
              //whew!
           });
       });
    });
};

Promise

We can use Promises to solve the indentation mess above. For example, this is how we might change our main function if we want to wrap the myAsyncFunction in a promise

const main = () => {

    const myPromise = new Promise((resolve, reject) => {
        myAsyncFunction("parameter", (result, error) => {
            if (!error) { //or other equivalent check
                resolve(result);
            } else {
                reject(error);
            }
        });
    });

    myPromise
    .then((result) => { /* handle result here */ })
    .catch((error) => { /* handle error here */ })
};

Promises can be chained, so now we don’t have to use nesting in order to use the result of an asynchronous operation:

const main = () => {

   //...

    myPromise
    .then(myOtherPromise)
    .then(myFinalPromise)
    .catch((error) => { /* handle error here */ })
};

More on Promises and chaining in the official docs.

Async/Await

Once we are already dealing with promises, it is possible to ditch the .then() and .catch() functions from the Promise API and use the async/await syntactic sugar:

const main = async () => {

    const myPromise = new Promise((resolve, reject) => {
        myAsyncFunction("parameter", (result, error) => {
            if (!error) { //or other equivalent check
                resolve(result);
            } else {
                reject(error);
            }
        });
    });

   const result = await myPromise;
};

(Remember to change the main function to be async!)

This way the code can look much more similar to synchronous code

const main = async () => {

   //...
   const result = await myPromise;
   const otherResult = await myOtherPromise;
   
   return result + otherResult;
};

Networking, Snippets, Unix

SSH Multiplexing and Master Mode

When using SSH bastion hosts it is common to set up new connections for many of the use cases discussed in the previous section throughout the day.
Normally we would start a new TCP connection for each one of them. However, open TCP connection are a finite resource on any machine, and each one of them takes some time to set up.

Multiplexing is a feature provided by SSH which alleviates these problems. It allows a single TCP connection to carry multiple SSH sessions. The TCP connection will be established and kept alive for a specific period of time and new SSH sessions will be established over that connection.

It works by creating a “control socket” file which will be used every time we want start a new connection.

We need to pass two command line arguments in order to leverage this feature:

Continue reading
Networking, Snippets, Unix

SSH X11 Forwarding

If you are using SSH between Unix-like operating systems, you can also forward GUI applications over SSH. This is especially useful if your server doesn’t really have a user interface, but you need to check something on the fly with a web browser running on it.

Descarga Navegador Firefox — Rápido, privado y gratis — de Mozilla

This is possible because all Unix-like systems share a common GUI windowing system called X11, which is what provides the basic framework for the desktop environment: drawing and moving windows on the display device and interacting with a mouse and keyboard, etc.

Continue reading
Networking, Snippets, Unix

SSH Tunnelling and Port Forwarding

In the previous section we saw how to make use of a jump host as a proxy to run commands into a remote machine.
Sometimes however, having a shell is not necessary, and the connectivity aspect of having a secure channel to the remote host is way more interesting.
SSH’s port forwarding feature allows us to create a secure channel to the remote host, and then use it to carry any type of traffic back and forth.

Local Port Forwarding

Scenario: you want to reach a specific application (which listens on a specific port) on the remote server.

For example (as in the image) we might want to connect to a PostgreSQL database application that is listening on port 5432 on the remote server, which is in a private subnet.

Continue reading