Scaling Battlesnake testing with Kubernetes
November 14, 2021
Note: this article assumes familiarity with kubernetes related keywords.
A friend and I are using Battlesnake as a platform for learning new tech skills. In my case, it is to continue to improve my DevOps technical skillset, mainly to finally take the plunge and learn Kubernetes.
Battlesnake is a multiplayer version of the classic game snake, where game engine sends HTTP requests to your web server. Your response simply needs to be up
, down
, left
or right
within a deadline of 500ms. How you achieve it and with which technologies are decisions all up to you.
As you iterate on your snake it becomes essential to test it frequently, both to avoid regressions such as causing your snake to prefer to headbutt snakes stronger than it, and to test whether your changes have made it a stronger contender or not.
Initially we achieved this by pushing the snake to our production EC2 instance hosted in AWS us-west-2 (as close as possible to the Battlesnake servers, since every millisecond of latency counts towards the 500ms deadline), however this process was slow, and made even slower by our move to graviton instances.
The Battlesnake team provides a command line version of their rules engine which allows for running of local games. This means that we could run our snake testing locally, ensuring our snake performs the basics (do not hit walls, do not hit yourself, do not run out of health, etc), along with seeing how it performs in a game against another snake (or potentially even against itself).
This is still limiting when it comes to testing two snake variants. You can only easily run the CLI sequentially. If your snakes are also running on the same computer, then they will compete for the same resources (CPU, RAM) that the CLI is using.
Ideally we should be able to run multiple instances of the CLI in parallel, along with multiple, load balanced instances of each snake variant. Here is where kubernetes comes in as a solution. Whilst this can be achived without kubernetes (there are options such as Hashicorp Nomad), k8s offers multiple features we need in a single package whilst also providing a great opportunity for learning.
A kubernetes based approach
Kubernetes provides an opportunity to orchestrate our software (packaged as containers) across a single or multiple nodes. It has existing concepts which aid in achieving outcomes such as service discovery, load balancing and parallelisation of jobs.
Here is what our setup looks like. It runs k3s on a Raspberry Pi based cluster:
Each snake is containerised and deployed to a Docker registry hosted alongside the cluster.
For each snake, we define a deployment. This maintains a replicaset underneath which creates a set of pods running the snake’s webserver. We also define a service which allows each web server’s pod to be reachable within the cluster. To allow reaching it from outside the cluster, we define it as type NodePort
and hardcode a port such as 30000
, in case we wish to curl
it manually.
Here’s a sample:
apiVersion: apps/v1
kind: Deployment
metadata:
name: foobar1
namespace: snakepit
labels:
app: foobar1
spec:
replicas: 3
selector:
matchLabels:
app: foobar1
template:
metadata:
labels:
app: foobar1
spec:
containers:
- name: foobar1
image: "our-registry-url:5000/snake-foobar1:testing-arm"
ports:
- containerPort: 8111
imagePullPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
labels:
app: foobar1
name: foobar1-service
namespace: snakepit
spec:
selector:
app: foobar1
type: NodePort
ports:
- protocol: TCP
port: 8111
targetPort: 8111
nodePort: 30000
The CLI is also containerised. We define a k8s Job
where the command
is specified to include all CLI args required to include the snakes also deployed within the cluster. We can even define these using the name of the service, since k8s will resolve DNS for us within the cluster (for example: http://foobar1-service:8111
). Here’s a sample:
apiVersion: batch/v1
kind: Job
metadata:
name: bscli
namespace: snakepit
spec:
completions: 4
completionMode: Indexed
parallelism: 2
template:
spec:
containers:
- name: bscli
image: "our-registry-url:5000/bscli:testing-arm"
imagePullPolicy: IfNotPresent
command:
[
"./battlesnake",
"play",
"-W",
"11",
"-H",
"11",
"--name",
"foobar1",
"--url",
"http://foobar1-service:8111",
"--name",
"pastaz2",
"--url",
"http://pastaz2-service:8111",
"-v",
"-t",
"500",
]
restartPolicy: Never
backoffLimit: 1
By adjusting completions
and parallelism
we can decide how large our testing round should be, along with how many games should run at the same time. In an ideal world: parallelism == completions
, however my cluster has limited resources and by running too many pods the end result will slow down the response time of the snake webservers, ending up in a situation where it takes longer to complete a round of testing. By constraining parallelism
, we can ensure that we do not saturate the cluster’s resources.
Future work
Our work is not finished yet. Right now, this system does not track the number of wins per snake in a testing round, which is the entire point of testing variants against each other. Here’s what we would need to achieve:
To achieve this, there are number of options:
- Tail and grep the logs of the CLI pods (the last line printed by the CLI states the winner).
This option is quick and dirty, however has some constraints, mainly that we would be doing it after the testing round is complete. If we had to interleave the logs from multiple pods from different testing rounds we would not be able to tell which result belongs to which testing round.
- Patch the CLI to write the results to a file and upload it to a webserver by including a script of some sort which will run after the CLI exits.
For this, I forked and patched the CLI to export the details of the game to a line delimited JSON file. I also opened a PR which has started the RFC process around this functionality.
The main issue here is that I would have to maintain a fork of the CLI and would I need to run the script after the game ends, potentially complicating how the container is built and how the current system of overriding the container’s command is done.
- Divert the CLI Pods’s logs to another Pod by streaming them to a sidecar, followed by uploading the result.
This would be the most “k8s native” way of reaching the goal. This means that there is no need to patch the CLI (for our current needs at least, which is simply to answer the question “did the snake win?”) as that is easily parseable from the current CLI’s output, whilst also keeping the script which will upload the result in a separate container within the Pod, keeping the CLI’s container clean and simple.
Once a solution is picked and implemented I’ll be back with another blog post. In the meantime, feel free to checkout our progress on the bantersnake team page!
Sidenote: The diagrams have been illustrated by myself on a ReMarkable2. Its a great piece of hardware but my lack of decent handwriting is pretty clear. I am hoping this will improve over time.