Cholick.com -

Modern CI/CD Is a Directed Graph of Containers

October 4, 2020 by Matt Cholick

I had quite a difficult time figuring out how exactly GCP's Cloud Build works from reading docs and articles. The marketing material describes its technical functionality poorly; I needed to write code and dive in to figure out how it really behaves. I found the lack of good examples frustrating enough that I decided to write up a post and the code to hopefully save someone else a bit of time. My confusion came from the fact that I thought it was more than it actually is. Cloud Build really just boils down to a triggered chain of containers that are executed with persistent state mounted across steps. It does include a few nice convenience integrations into the larger platform, like auth, but there really is no magic.

My personal stuff has been running on VMs managed by Ansible for quite a few years now. This paradigm made sense when I set it up, but a lot has changed in the intervening years. My playbooks are feature rich and include both blue-green deployment and fully deploying software from a newly provisioned VM. Not working in that space, though, has atrophied the skills I need to maintain a large playbook.

Containers make a lot more sense today anyway, especially with all the layers of sugar that various clouds have built on top of basic orchestration. One of the features I've missed for a while is full automation post-commit: I'd like to be able to make a small edit directly on GitHub and have that change automatically built, tested, and deployed. It's been possible for quite a while. The latest full server rebuild from playbooks I barely remember finally motivated me to invest.

I've been using Concourse for quite a few years, and I have become quite fond of it (once I made peace with its statelessness). For something small, though, it's a bit heavy. I also don't like having the engine I need to restore software from scratch in the same cluster for disaster recovery reasons (my budget for toy projects is a single cluster). There are quite a few of hosted solutions outside of Concourse that would address that need, but starting with GCP's native offering is a pretty low friction choice, especially when the free tier would completely cover my needs.

Enter Cloud Build. I've skimmed the docs and once attended a talk, but I hadn't understood the core of it. I think I must have skipped over the key sentence that summarized the tool: Cloud Build just executes any standard container and does so in a context with some shared state. This model absolutely makes sense, but I had a different impression. Having used Concourse for so many years, as well as briefly testing out Drone, Jenkins X, and CircleCI, this is definitely the paradigm that modern CI/CD systems have settled on. The containers are run sequentially (with parallelization possible), steps return a non-zero exit code to indicate failure, and state is piped to the next container or otherwise stored. That's it; all the modern systems boil down to that and differentiation is just UI and various convenience features.

While trying to make a complete pipeline example work, I found two aspects of the tool confusing. The first of those is Cloud Builders. I came away from docs and examples assuming these were a first class concept, that there is a specific contract between a builder and the system executing it. There isn't; there is no special sauce in the builders. My suggestion is to mostly stick to other containers. For a lot of functionality, there is likely a better maintained and documented container out there.

Images were the second misleading concept for me. The cloudbuild.yaml file can have an images key. Using that, though, doesn't make the image available in GCR until the completion of the pipeline run. That's too late for the pipelines I want to build: I expect a pipeline to unit test, build, and deploy into a cluster. The deploy step doesn't work in this scenario, though, because images aren't available in GCR for a pull by the target cluster. Pipelines have to perform their own push when a pipeline step drive that pull.

I've put the full Cloud Build example pipeline code up on GitHub. The pipeline unit-tests, compiles, lints, builds the image, pushes the image, deploys to a Kubernetes cluster, and tests the deployed workload. The images key in the build yaml doesn't affect how the pipeline works, but it does add a link to the built image in the GCP UI.

unit-test: Runs unit tests using the official Golang image
build-binary: Builds the linux binary, which will be available to subsequent steps. This and the next few steps are executed in parallel using the waitFor value of "-"
helm-lint: Lints the Helm chart
go-lint: Lints the Go code
build-image: Builds the docker image. Subsequent docker commands will have access to the image
push-image: Pushes the image to GCR
install-dev: Installs the software (via Helm) into a cluster controlled by the step's environment variables. This is a one step where a builder did prove useful
prep-e2e: Installs the end-to-end Python tests' pre-requisites. The target flag coordinates with PYTHONPATH in the subsequent step
e2e: Performs end-to-end tests via Python

The full build file follows.


images: [ 'gcr.io/${PROJECT_ID}/cbt:${REVISION_ID}' ]

steps:
- id: unit-test
  name: "golang:1.15"
  env: [ 'GO111MODULE=on' ]
  args: [ 'make', 'test' ]

- id: build-binary
  name: "golang:1.15"
  env: [ 'GO111MODULE=on' ]
  args: [ 'make', 'build-linux' ]
  waitFor: [ '-' ]

- id: helm-lint
  name: 'gcr.io/$PROJECT_ID/helm-builder'
  args: [ 'lint', 'deployment/cbt', '--strict' ]
  waitFor: [ '-' ]
  env: [ 'SKIP_CLUSTER_CONFIG=true' ]

- id: go-lint
  name: "golangci/golangci-lint:v1.31"
  args: [ 'golangci-lint', 'run', './...','--enable', 'gocritic,testpackage' ]
  waitFor: [ '-' ]

- id: build-image
  name: 'docker'
  args: [
      'build', 'deployment/docker',
      '-t', 'gcr.io/$PROJECT_ID/cbt:$REVISION_ID',
      "--label", "org.opencontainers.image.revision=${REVISION_ID}",
  ]

- id: push-image
  name: 'gcr.io/cloud-builders/docker'
  args: [ 'push', 'gcr.io/${PROJECT_ID}/cbt:${REVISION_ID}' ]

- id: install-dev
  name: 'gcr.io/$PROJECT_ID/helm-builder'
  args: [
      'upgrade', 'cbt-dev', 'deployment/cbt', '--install',
      '--wait', '--timeout', '1m',
      '--namespace', 'dev', '--create-namespace',
      '-f', 'deployment/values-staging.yaml',
      '--set', 'image.repository=gcr.io/$PROJECT_ID/cbt',
      '--set', 'image.tag=${REVISION_ID}',
  ]
  env: [
      'CLOUDSDK_COMPUTE_ZONE=us-central1-b',
      'CLOUDSDK_CONTAINER_CLUSTER=hello-cloudbuild'
  ]

- id: prep-e2e
  name: 'python:3.8-slim'
  args: [
      'pip', 'install',
      '--target', '/workspace/lib',
      '--requirement', '/workspace/test/requirements.txt'
  ]

- id: e2e
  name: 'python:3.8-slim'
  args: [
      'python', '-m', 'unittest', 'discover',
      '--start-directory', 'test',
      '--pattern', '*_test.py'
  ]
  env: [ "PYTHONPATH=/workspace/lib" ]

Finally, these are the two references that I found the most useful:

The full build file syntax specification
The documentation that describes substitution and lists all the platform-provided values