Tag: kubernetes

Mitigating Infrastructure Drift by using Software Development Principals – Part 2

Published by Fredrik on July 22, 2021

If you haven’t read the first part on how to mitigate infrastructure drift using software development principals you should! I will refer to parts mentioned in that post during this second, hands-on part.

Work Process

Let’s start by exploring how a simple work process might look like during software development.

I have included post deployment testing because it is important but I’m not going to talk about how to orchestrate it within deployments, it is out of scope for this article.

Let’s add source control, automation and feedback loops. Traditionally there might be multiple different automation tools responsible for the different parts.

Say hi to trunk based development!

When the team grows, the capability of building functionality in parallel grows. This is where the process complexity also grows. The simplest way of doing parallel work is to duplicate the process and let it run side by side. However, it makes little sense developing functionality within a domain if we cannot use it all together, we need to integrate.

Remember that automation is key for moving fast and the goal is to deploy to production as fast as possible with as high quality as possible. The build and test processes increases quality while automation enables fast feedback within all of the processes pictured.

Integration introduces a risk of breaking functionality, therefor we’d like to integrate as small changes as possible as fast as possible to not loose momentum moving towards production.

Let’s expand trunk based development with a gated check-in strategy using pull requests.

With automation and repeating this process many times a day we now have continuous integration.

A great source control platform that can enable this workflow is for example GitHub.

GitOps

In the previous article I talked about the power of GitOps, where we manage processes and configuration declaratively with source control. By using GitHub Actions we define this process declaratively using YAML and source control it together with the application code. By introducing a simple branching strategy to isolate changes during the parallel and iterative development cycle we also isolate the definition for building and testing.

But why stop there? Git commits are immutable and signed with a checksum and it’s history is a directed acyclic graph of all changes. That means that any thing stored in Git has strong integrity. What if we can treat each commit as a release. Three in one! Source control, quality checks and release management.

This is called continuous delivery.

Infrastructure as Code

Let’s see if a similar automation process can be used for managing declarative cloud infrastructure.

Almost the same. Deployment has been moved into the process enabling continuous deployment. Deployment becomes tightly coupled to a branch, let’s explore this further.

Environments

Traditionally deployments are decoupled from the development process because there is a limitation on the amount of environments to deploy to, causing environments to become a bottle neck for delivery. Deploying to the local machine might also be a completely different process, further complicating and deviating the development process. It would make sense having an environment for every development cycle to further expand on the simplicity of trunk based development. One environment for every branch no matter what the purpose of the environment is.

Using cloud infrastructure we can do that by simply moving the environments to the cloud!

Since each branch represents an environment, it makes working with environments simple.

Need a new environment? Create a branch!
Switch environment? Switch branch!
Delete an environment? Delete the branch!
Upgrade an environment? Merge changes from another branch!
Downgrade an environment? git reset!

Cost

A common concern with infrastructure and environments is cost. Continuously observing how cloud infrastructure related costs changes over time becomes even more important when all environments are in the cloud since more of the cost becomes related to resource utilization. Most cloud providers have tools available for tracking and alerting on cost fluctuations and since all environments are built the same the tools can be used the same way for all environments. This also enabled observing how cost changes faster and doing something about it even earlier in the development process.

If development environment costs do become too steep, they usually do not need the same amount of resources that exist in a production environment. For performance related development it might still be relevant, but in all other cases lowering cost is quite easy to achieve by lowering the resource tiers used and using auto-scaling as a built in strategy. The latter also lowers cost and increases efficiency for production environments by maximizing resource utilization.

In comparison, how much does building and maintaining local infrastructure for each employee cost? How much does it cost to set up a new on-prem environment, local or shared?

Example with Terraform

There are different tools that can help build cloud infrastructure. We’re going to use Terraform and the HashiCorp Configuration Language as the declarative language.

Let’s start by defining how to build, test and deploy. Here’s a simple GitHub Action workflow that automatically builds infrastructure using the previously mentioned workflow:

name: Build Environment

on: push

jobs:
  build:
    name: Build Environment
    runs-on: ubuntu-latest

    env:
      branch: ${{ github.ref }}

    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Creating Environment Variable for Terraform
        run: |
          branch=${{ env.branch }}
          branch_name=${branch#refs/heads/}
          env=${branch_name////-}
          env=${env//_/-}

          cat << EOF > env.auto.tfvars
          env = "$env"
          EOF

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v1

      - name: Terraform Init
        run: terraform init 
        
      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        id: plan
        run: terraform plan -out=terraform.tfplan -var-file=config.tfvars

      - name: Terraform Plan Status
        if: steps.plan.outcome == 'failure'
        run: exit 1

      - name: Terraform Apply
        run: terraform apply -auto-approve terraform.tfplan

Building could be translated to initiating, validating and planning in Terraform.

When initiating, Terraform initializes the working directory by setting up backend state storage and loading referenced modules. Since many persons might work against the same environment at the same time, it is a good idea to share the Terraform state for the environment module by setting up remote backend state storage that can be locked in order to guarantee consistency. Each environment should have it’s own tracked state which means it needs to be stored and referenced explicitly per environment. This can be done using a partial backend configuration.

To validates that the configuration files are syntactically valid, terraform validate is executed.

Running terraform plan creates an execution plan containing the resources that needs to be created, updated or deleted by comparing the current state with the current configuration files and the objects already created previously. The env.auto.tfvars configuration file created in the second step contains the environment name based on the branch name and can be used to create environment specific resources by naming conventions.

Last step is to apply/deploy the execution plan modifying the targeted resources.

Application Platforms

The infrastructure architecture we have explored so far is quite simple and mostly suited for one-to-one matches between infrastructure and application. This might work well for managed services or serverless but if you need more control you might choose an application platform like Kubernetes.

A service does not only comprise by a binary, it needs a host, security, network, telemetry, certificates, load balancing etc. which quite fast increases overhead and complexity. Even though such platforms would fit single application needs, it does become unnecessary complex operating and orchestrating an application platform per application.

Let’s have a look how Azure Kubernetes Service could be configured to host our applications:

An application platform like Kubernetes works like a second infrastructure layer on top of the cloud infrastructure. It does help simplify operating complex distributed systems and systems architected as microservices, especially when running as a managed service. Kubernetes infrastructure abstractions makes it easier for applications to provision and orchestrate functionality while preserving much of their independency. The applications do also still control provisioning of infrastructure specific for their needs outside Kubernetes.

Provisioning Kubernetes and all it’s cloud specific integrations has become separated from application specific infrastructure provisioning. Application infrastructure on the other hand has taken a dependency on the Kubernetes API alongside cloud resource APIs. The environment has become tiered and partitioned.

I do again want to emphasize the importance of having autonomous, independent and loosely coupled cross-functional teams. Each team should own their own infrastructure for the same reason they own their applications. A Kubernetes cluster should not become a central infrastructure platform that the whole company depends on.

Separated Application Workflow

Since the Kubernetes infrastructure has been decoupled from the application workflow it could make sense moving the applications to separate repositories. As they have become autonomous we need to figure out a simple way to reintegrate the applications downstream. Since branches defines the environment of the application platform infrastructure, we could simply go for a similar branch naming standard, i.e. MyEnvironment/SomethingBeingDeveloped.

Looking at the Kubernetes platform architecture, the GitOps continuous delivery tool ArgoCD is responsible for deploying applications in the cluster. The process for delivering applications becomes similar to the GitOps process described earlier, where deployment becomes a reversed dependency. Instead of deploying to a specific environment after releasing, ArgoCD is instructed to observe for new releases in an application’s repository and if the release matches the strategy, it becomes deployed. This means that many ArgoCD instances can monitor and independently deploy many applications across many k8s clusters without any intervention.

Here is the process again:

We still provision application specific infrastructure, that works the same way as described earlier, except we now have an upstream dependency; the Kubernetes cluster(s) in the targeted environment. To keep the applications separated in Kubernetes, we use separate namespaces. This is also where we share platform defined details, for example where to find the environment’s container registry. We can do this by creating ConfigMaps.

Namespaces can also be used to limit resource access for users and service principals minimizing exposed attack surfaces. Access rights for each application can be defined in the upstream infrastructure repository. Since we use a managed Kubernetes service, which has integration with active directory, we can leverage access to both Kubernetes and cloud infrastructure through managed identities.

Tying it all together with a GitHub Action workflow, it could look something like this:

name: Build and Release Application

on: push

jobs:
  build:
    runs-on: ubuntu-latest

    env:
      branch: ${{ github.ref }}
      version: 1.2.3

    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Extracting Environment From Branch Name
        run: |
          branch=${{ env.branch }}
          branch_name=${branch#refs/heads/}
          env=${branch_name%/*}
          echo "env=$env" >> $GITHUB_ENV

      - name: Login to Azure
        uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Set AKS Context
        uses: azure/aks-set-context@v1
        with:
          creds: '${{ secrets.AZURE_CREDENTIALS }}'
          cluster-name: my-cluster
          resource-group: rg-${{ env.env }}

      - name: Fetch Environment Metadata
        run: |
          ENVIRONMENT_METADATA=$(kubectl get configmap/metadata -o go-template={{index .data "metadata.yaml"}} | docker run -i --rm mikefarah/yq eval -j)
          ACR_NAME=$(echo "$ENVIRONMENT_METADATA" | jq .acr.name | xargs)
          echo "ACR_NAME=$ACR_NAME" >> $GITHUB_ENV
          ACR_RESOURCE_GROUP_NAME=$(echo "$ENVIRONMENT_METADATA" | jq .acr.resourceGroupName | xargs)
          echo "ACR_RESOURCE_GROUP_NAME=$ACR_RESOURCE_GROUP_NAME" >> $GITHUB_ENV

      # Here could any application specific infrastructure be applied with Terraform
      # ...

      - name: Build and Push Application Container Images
        run : |
          az acr build --registry ${{ env.ACR_NAME }} --resource-group ${{ env.ACR_RESOURCE_GROUP_NAME }} --file path/to/Dockerfile --image my-application:latest --image my-application:${{ env.version }} .

      - name: Update Deployment Strategy (optional)
        run: |
          read -r -d '' helmvalues <<- YAML
          image_tag: ${{ env.version }}
          YAML

          cat application.yaml | \
            docker run -e HELMVALUES="$helmvalue" -i --rm mikefarah/yq eval '(.spec.source.helm.values=strenv(HELMVALUES) | (.spec.source.targetRevision="refs/heads/${{ env.branch }}")' - | \
            tee application.yaml
          
          kubectl apply -f application.yaml -n argocd

What about running and debugging an application during development? Use Bridge to Kubernetes straight from your IDE!

Shared Components

An important strategy when building environments (and applications) is that they need to be autonomous. However some components might need to be shared or at least moved upstream, like the backend state storage and permissions to create and manage components on the cloud platform.

These components should live in separate repositories with similar development strategies.

Permissions

Defining different roles is good practice in order to align with the least privilege principal. Depending on the size of the company, persons might have multiple roles, but remember that autonomous, cross-functional teams are important to move fast, so each team should have all the roles needed to deliver their applications.

Lessons Learned

One important thing about mitigating drift between environments is to continuously integrate between them. In an ideal world, each change is directly integrated into all environments. However, in reality, that will not always happen, which might cause incompatibility issues when introducing changes to environments. Upgrading from A to B to C is not the same thing as upgrading directly from A to C. That is usually what happens when a branch is merged into another branch, and with cloud infrastructure this might lead to unexpected problems. An example is that no minor versions can be skipped when upgrading Azure Kubernetes Service.

Can I skip multiple AKS versions during cluster upgrade?

When you upgrade a supported AKS cluster, Kubernetes minor versions cannot be skipped.

This means that Terraform needs to apply each commit in order. This can be a bit cumbersome to orchestrate.

Another important aspect with infrastructure as code is that it is quite easy to render yourself into an inconsistent state if manually changing configurations, which can be tempting in pressing situations and with the cloud platform tools easily available. Don’t fall in that trap.

Conclusions

Working cross-functional is powerful. It enables teams to work autonomous, end-to-end within a loosely coupled piece of domain. It also enables teams to use development workflows that includes both applications and infrastructure from start simplifying how to make them work efficiently together. By using the same infrastructure for all environments, continuously merging changes downstream, it can help mitigate drift, while simplifying managing infrastructure changes.

Leave a Comment

Spdy – A .NET client

Published by Fredrik on February 4, 2021

Protocols, more protocols! I love deep-diving into them, and have wanted to explore the secret mysteries of HTTP/2 and gRPC ever since they were released some years ago, but havn’t had any particular reasons to do so, until now.

Kubernetes and the Streaming APIs

You have probably used them many times, the streaming functions exposed by the Kubernetes API Server; exec, port-forward, attach, logs etc. They all stream data back and forth between your local host and containers running on a Kubernetes node. They more or less support one data transport layer; Spdy (…and partly WebSocket) (…and sooooon HTTP/2).

Spdy

Spdy is a deprecated Chromium project that has been superseeded by HTTP/2. The Spdy protocol specification layed the foundation for the HTTP/2 specification and therefor they naturally share a lot of concepts.

Spdy was born due to the increasing usage of network bandwidth and need for low latency communication by modern web applications where HTTP/1.1 (and WebSocket) become bottle necks.

Web applications today depend on a lot of resources, but HTTP/1.1 only supports one request at a time per connection. This has somewhat been mitigated by browsers creating more connections towards the web servers, which is not very effecient considering the growth rate of consumer ip traffic. At best, this is a work around which at the end leads to socket starvation problems for the web servers.

Other limitations are one-way-communication initiation (client to server), redundant and uncompressed metadata (headers), downstream throttling and head-of-line blocking.

Spdy <3 .NET

Up until now, there hasn’t been a Spdy implementation available for .NET (as far as I can tell), and since I needed a reliable, duplex communication transfer protocol to integrate my upcoming project, Port, with Kubernetes’ port-forward API, I decided to implement Spdy, a high level client / server implementation of the 3.1 protocol specification!

As with the Kafka Test Framework, Spdy also utilizes System.IO.Pipelines to minimize buffer copying resulting in lower latency and CPU usage. It also comes with an ASP.NET Core middleware (soon) enabling upgrades of HTTP/1.1 requests to Spdy similar to how requests can be upgraded to WebSocket.

For a sneak peek of the Kubernetes port-forward integration using Spdy, head over to Port’s repository and the Spdy port forwarder.

Leave a Comment

Raspberry PI <3 Kubernetes

Published by Fredrik on December 12, 2019

I have been curious for a while if I could build a powerful Kubernetes cluster that is compact, cheap and easy to use. Then one day I came across BitScope and their blade racks, and I knew I had to get one!

Yepp, that is a 20 node Kubernetes cluster of Raspberry PI 4 model B 4GB.

The Build

I got the 20 unit blade rack which includes two Duo Pi Blade Packs and 10 Duo Pis. It is delivered as a kit that you need to assemble yourself.

Time for assembly!

You also need a couple of Raspberry PIs!

That’s a lot of computers

To handle networking I use a 24 port Netgear GS324T gigabit switch with 20 0.3m flat U/UTP Cat6 RJ45 ethernet cables. I use 32GB SanDisk Ultra micro-SDHC (Class 10 / UHS-I) cards as storage for each PI, which is enough to handle the DDR50 transfer mode supported by the PI. Since the Raspberry PI does not have any cooling and is installed in a quite cramped area, I bought some passive heatsinks and jerry-rigged two silent 200mm case fans from Noctua in order to prevent throttling during load. The blade rack is delivered with four 30mm fans that are attached to the back, but these are very noisy. To power it all I use a 12V20A power supply.

The cost? Around €2000 excl. tax. Not that bad, considering this a full-blown cluster with a total of 80 cpu cores @ 1.5GHz and 80GB DDR4-3200 SDRAM.

There she blows!

Configuration

I’m not going to go into details how to setup the PI’s and installing Kubernetes on them, it is already well explained here, but I will give an overview of the setup. The PIs are running Raspbian Lite, the official supported operating system from the Raspberry PI Foundation. The setup is quite easy with flashing the image and setting up some basic configuration. As Kubernetes cluster, I use k3s from Rancher Labs, a lightweight version of Kubernetes supporting the ARMv7 processor architecture supported by the PI’s Cortex-CPU. It does not come with Docker bundled but the smaller containerd, which is a graduated project under the Cloud Native Computing Foundation, CNCF.

Fun fact by the way; k8s is a numeronym for Kubernetes, but what k3s means is more of a mystery. I’m gonna go with Kubes!

Lessions Learned

During the installation of Kubes, I stumbled upon a couple of issues that I want to share.

Explict Version

Running curl -sfL https://get.k3s.io | sh - will download a boostrap script of k3s which targets the latest version of the k3s image. If the installation for the whole cluster takes a couple of days, you might like me end up with a new latest version released causing incompatibility. I ended up getting the error “tls: failed to find any PEM data in key input” when trying to join one newly installed agent because of a change in how Go decodes key files which was introduced in a later version.

Always target an explicit version using INSTALL_K3S_VERSION=<version>, ie curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=<version> sh -.

iptables

k3s is using iptables-legacy but Rasbian/Buster is using iptables 1.8.2 nft which sometimes seems to cause problems when adding the agent to the cluster. To solve it, set iptables to legacy by executing sudo update-alternatives --set iptables /usr/sbin/iptables-legacy.

VPN

Trying to resolve local hostnames on the local network in Ubuntu (WSL) when connected to an external VPN service can be a hassle!

Missing Files During Installation

Sometimes the k3s installation process fails storing all files on the SD card in the PI causing the k3s agent to fail with errors like:
FATA[0000] exec: "k3s-agent": executable file not found in $PATH

The resolution is to remove /var/lib/rancher/k3s/data and re-launch the configuration process.

Shutting Down the Cluster

Just pulling the plug from a Raspberry PI is never a good idea, but shutting down a whole cluster of them one by one is tedious. I use the following script to shut them all down.

hostnames=(<server>)
for ((i=1;i<=<number-of-clients>;i++)); do
    hostnames+=("<client>-$i")
done
for hostname in ${hostnames[@]}
do
    ssh pi@$hostname 'sudo poweroff'
done

Exchange <server> with the server’s hostname, <client> with the starting name of the clients hostname and <number-of-clients> with the actual number of clients.

Conclusions

Running Kubernetes using k3s on a couple of Raspberry PI works quite good, without having battle tested the setup yet. Heck, I don’t even know what I will use it for! But it’s pretty, right? 🙂

Leave a Comment

Scaling in process migrations with Kubernetes

Published by Fredrik on October 10, 2019

I was recently involved in migrating data from one database to another for an application deployed to Kubernetes. The idea was to use an in-process migration routine that kicks in during the startup process of the application. Resilience and scaling were to be handled entirely by Kubernetes. Any failure would cause the application to fail fast and let k8s restart it.

This turned out to be quite simple to implement!

We use a rolling update strategy in order to let the newly deployed service migrate the data side-by-side with the old one still running the actual application process. This is defined in the application’s deployment manifest:

strategy:
  type: RollingUpdate
  rollingUpdate:
     maxUnavailable: 50%
     maxSurge: 50%

With a replica count of 8, we will now end up with a minimum of 4 pods running the old version and up to 8 pods running the migration.

However, for the rolling update strategy to work, we also need a readiness probe for the service to tell the deployment when it is okay to swap out pods. Since the application is a message streaming service hosted in a generic .NET Core host, we could simply use an ExecAction probe that executes cat against a file which existence we can control during the life cycle of the application. Simple!

The application’s life cycle went from:

private static async Task Main(string[] args)
{
    using var host = new HostBuilder().Build();
    await host.RunAsync();
}

…to something like this:

internal class Program
{
    private static async Task Main(string[] args)
    {
        using var host = new HostBuilder().Build();
        await host.StartAsync();
        File.Create("/Ready");
        await host.WaitForShutdownAsync();
        File.Delete("/Ready");
        await host.StopAsync();
    }
}

During the startup phase, the migration takes place. At this time, no file called “Ready” exists in the application’s execution directory. When start finishes (the migration is done and the application is ready to serve), the Ready file is created. When sigterm is received, the Ready file is deleted and we start the shutdown process. At this time, the application is no longer ready to serve.

What about the probe configuration? Easy!

readinessProbe:
  exec:
    command: 
    - cat
    - /Ready
  periodSeconds: 2

Every other second, the container will be checked for a file called Ready. If it exists, the service is considered ready for service and the deployment will continue and deploy the next pod according to the update strategy.

Need more scaling? Just add more replicas!

Leave a Comment