Running Cockroach DB in Production with Nomad

April 18, 2019

For those that know me, they know I have been a huge fan of HashiCorp Nomad for the past year or so. I have been very open with my thoughts that I don’t believe that sweetspot of the tool is for scheduling containers. I believe the ship has sailed in the, very fashionable, direction of Kubernetes for container scheduling. I have been using Nomad in a similar vein to how others are supervising their services with tooling like systemd.

I needed to make a deployment of cockroach db to support a number of applications that were using it. This could have been achieved using a configuration management tool and systemd, but I thought that this would be a very good use case for Nomad. It has the ability to schedule applications across a number of hosts, it will replace an instance of an application that has died, it supports rolling updates to ensure high availablity etc. So what was the worst that could happen?

This post assumes that you already have a Nomad cluster running. If you need help on doing that, please have a read through the documentation available for that.

A nomad job file can be specified in either HCL or JSON and is broken into a number of sections. The hierarchy of a job looks as follows:

job
- group
  - task

A TL;DR in this area is that a job is a globally unique way of separating items within nomad. A job can have 0 or more groups where a group is a way to define a series of tasks that need to be co-located on the same nomad client and a task is an individual unit of work, e.g. running a service or a contaniner.

job "cockroach-db-cluster" {
   ...
}

Job

The only parts of my nomad job that are not part of the group are as follows:

datacenters = ["dc1"]
type        = "service"

update {
   health_check = "task_states"
   max_parallel = 1
   stagger      = "12s"
}

group "db-cluster" { 
   ...
}

This specifies that the job is a service that will run in the dc1 datacenter. It also contains information about the update strategy for the job. In this case, it will update a single allocation of the job at a time and it will be marked as healthy (via task_states) when all tasks in the group are running. It will stagger the allocation changes in sets of 12 seconds. This is slightly less than the default of 30 seconds in this area.

Group

A group is a way to add constraints and the number of allocations of a set of tasks. The group is made up of the following:

ephemeral_disk {
   sticky = true
}

count = 3

constraint {
   distinct_hosts = true
}

task "cockroach-cluster" {
   ...
}

As this is a database, I would rather not have to wait for nomad to copy data around my cluster when it needs to reschedule an allocation. Therefore, setting the ephemeral_disk stanza in this way means that nomad will make a best-effort attempt to keep the allocation on the same machine and therefore, only the local/ and alloc/data will need to be moved to another folder on the disk.

We are then telling nomad that we want 3 allocations of the task and we are constraining the tasks to make sure they do not run on the same hosts. This ensures that we are spreading our database across our cluster and try to ensure that it is HA.

Task

The task group is where all the individual detail of how to run cockroach db happens. This is a pretty complex example and is actually linked to Vault to get certificates for a TLS cockroach cluster. I will try and make it as clear as possible what parts are optional and what are not.

driver = "raw_exec"

artifact {
  source = "http://mybucket.s3-us-west-2.amazonaws.com/cockroach-linux-amd64.tar.gz"
}

vault {
  policies = ["nomad-pki"]

  change_mode   = "signal"
  change_signal = "SIGUSR1"
}

template {
  data = <<EOH
{{ with secret "mypki/issue/crdb" "ttl=720h" "common_name=node" "ip_sans=192.168.0.1,192.168.0.2,192.168.0.3,127.0.0.1" "alt_names=db-cluster.service.consul" "format=pem" }}
{{.Data | toJSON }}
{{ end }}
EOH

  destination   = "local/bundle.json"
  change_mode   = "signal"
  change_signal = "SIGHUP"
  splay         = "10m"
}

template {
  left_delimiter  = "(("
  right_delimiter = "))"

  data = <<EOH
  {{- printf "%s\n" (datasource "bundle").private_key -}}
  EOH

  destination = "local/node.key.tmpl"
  perms       = "600"
  change_mode = "noop"
}

template {
  left_delimiter  = "(("
  right_delimiter = "))"

  data = <<EOH
    {{- printf "%s\n" (datasource "bundle").certificate -}}
  EOH

  destination = "local/node.crt.tmpl"
  perms       = "644"
  change_mode = "noop"
}

template {
  left_delimiter  = "(("
  right_delimiter = "))"

  data = <<EOH
    {{- range $index, $value := (datasource "bundle").ca_chain -}}
    {{- printf "%s\n" $value -}}
    {{- end -}}
  EOH

  destination = "local/ca.crt.tmpl"
  perms       = "644"
  change_mode = "noop"
}

config {
  command = "gomplate"

  args = [
    "-d",
    "bundle=file://${NOMAD_TASK_DIR}/bundle.json?type=application/json",
    "-f",
    "local/ca.crt.tmpl",
    "-o",
    "local/ca.crt",
    "-f",
    "local/node.crt.tmpl",
    "-o",
    "local/node.crt",
    "-f",
    "local/node.key.tmpl",
    "-o",
    "local/node.key",
    "--",
    "${NOMAD_TASK_DIR}/cockroach",
    "start",
    "--certs-dir=${NOMAD_TASK_DIR}",
    "--join=192.168.0.1:26257,192.168.0.2:26257,192.168.0.3:26257",
    "--cache=.25",
    "--max-sql-memory=.25",
    "--store=${NOMAD_TASK_DIR}/data,size=90%",
    "--logtostderr=INFO",
  ]
}

service {
  name = "${TASKGROUP}"
}

Firstly, I am specifying a driver of raw_exec. This is actually disabled by default in Nomad. So you will need to enable this driver before you can use this way of working. The reason I chose raw_exec was that I needed to wrap my commands using gomplate to generate the correct certificate structure. Gomplate was available on the system PATH, so the job needed to be able to communicate with that.

Next there is an artifact stanza. This allows us from where to download the cockroachdb binary.

Next, I have a vault stanza. This allows nomad to get the correct certificates required from vault to run the cluster in a TLS manner. If you don’t need to run cockroachdb in TLS, then you can omit this part.

All of the template stanzas mean I can acquire a certificate bundle from my PKI and then it actually uses different parts of the certificate bundle as cockroachdb required a cert, ca and key. At the time of writing this post, it does not support passing a certificate bundle. If you don’t need to run cockroachdb in TLS, then you can omit this part.

Next is the config stanza. This is the configuration needed to pass to the task driver so that the task can be run. In this case, as I needed to deal with putting the correct certificates in the correct place, I needed to execute the gomplate command. The end of the command args has -- which means at this point we can pipe through to the cockroach binary. Notice here, I had to make sure and add the location of the binary, as I was executing a command that was already within PATH, not the task directory.

Lastly, I have a service stanza. This instructs nomad to register a service in HashiCorp Consul with the name that is specified in the taskgroup stanza, in this case, that value would be “db-cluster”.

If you would want to run the same task without the need to talk to Vault to get TLS certificates, then it would look as follows:

driver = "exec"

artifact {
  source = "http://mybucket.s3-us-west-2.amazonaws.com/cockroach-linux-amd64.tar.gz"
}

config {
  command = "cockroach"

  args = [
    "start",
    "–-insecure",
    "--join=192.168.0.1:26257,192.168.0.2:26257,192.168.0.3:26257",
    "--cache=.25",
    "--max-sql-memory=.25",
    "--store=${NOMAD_TASK_DIR}/data,size=90%",
    "--logtostderr=INFO",
  ]
}

service {
  name = "${TASKGROUP}"
}

The template has a lot less moving parts and the driver can move back to exec.

Overall, the entire template looks as follows:

job "cockroach-db-cluster" {
  datacenters = ["dc1"]
  type        = "service"

  update {
    health_check = "task_states"
    max_parallel = 1
    stagger      = "12s"
  }

  group "db-cluster" {
    ephemeral_disk {
      sticky = true
    }

    count = 3

    constraint {
      distinct_hosts = true
    }

    task "cockroach-cluster" {
      driver = "raw_exec"

      artifact {
        source = "http://mybucket.s3-us-west-2.amazonaws.com/cockroach-linux-amd64.tar.gz"
      }

      vault {
        policies = ["nomad-pki"]

        change_mode   = "signal"
        change_signal = "SIGUSR1"
      }

      template {
        data = <<EOH
{{ with secret "mypki/issue/crdb" "ttl=720h" "common_name=node" "ip_sans=192.168.0.1,192.168.0.2,192.168.0.3,127.0.0.1" "alt_names=db-cluster.service.consul" "format=pem" }}
{{.Data | toJSON }}
{{ end }}
EOH

        destination   = "local/bundle.json"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        splay         = "10m"
      }

      template {
        left_delimiter  = "(("
        right_delimiter = "))"

        data = <<EOH
        {{- printf "%s\n" (datasource "bundle").private_key -}}
        EOH

        destination = "local/node.key.tmpl"
        perms       = "600"
        change_mode = "noop"
      }

      template {
        left_delimiter  = "(("
        right_delimiter = "))"

        data = <<EOH
          {{- printf "%s\n" (datasource "bundle").certificate -}}
        EOH

        destination = "local/node.crt.tmpl"
        perms       = "644"
        change_mode = "noop"
      }

      template {
        left_delimiter  = "(("
        right_delimiter = "))"

        data = <<EOH
          {{- range $index, $value := (datasource "bundle").ca_chain -}}
          {{- printf "%s\n" $value -}}
          {{- end -}}
        EOH

        destination = "local/ca.crt.tmpl"
        perms       = "644"
        change_mode = "noop"
      }

      config {
        command = "gomplate"

        args = [
          "-d",
          "bundle=file://${NOMAD_TASK_DIR}/bundle.json?type=application/json",
          "-f",
          "local/ca.crt.tmpl",
          "-o",
          "local/ca.crt",
          "-f",
          "local/node.crt.tmpl",
          "-o",
          "local/node.crt",
          "-f",
          "local/node.key.tmpl",
          "-o",
          "local/node.key",
          "--",
          "${NOMAD_TASK_DIR}/cockroach",
          "start",
          "--certs-dir=${NOMAD_TASK_DIR}",
          "--join=192.168.0.1:26257,192.168.0.2:26257,192.168.0.3:26257",
          "--cache=.25",
          "--max-sql-memory=.25",
          "--store=${NOMAD_TASK_DIR}/data,size=90%",
          "--logtostderr=INFO",
        ]
      }

      service {
        name = "${TASKGROUP}"
      }
    }
  }
}

You can follow the documentation supplied by HashiCorp on running a job. That will then schedule the correct number of instances in the database cluster.