Bidding for Builds
This post covers the how of using EC2 spot instances for continuous integration (CI), and the why you want to do this.
Really, a whole post on that?
For a CI system to usable, it must fulfill specific needs.
- Builds must be (largely!) reproducible
- Providing access control
- Delivering logs
- Allowing build generation via multiple means
- Accurately allocating compute resources
- Allowing artifact archival
- Allow arbitrary command execution
If you think about these needs for a bit, a well-developed CI system begins to look a bit like a simplistic execution platform.
Such an execution platform was required for an internal Two Six Labs project. Retrofitting CI was a good way to meet that need, and this post covers the details.
The starting point – Gitlab CI
Gitlab tightly integrates the addressing of all CI system needs. Using it allows us to centralize user permissions across both source control and our “simplistic execution platform”. Gitlab CI also provides the ability to trigger jobs via webhooks, and passing environmental variables to jobs as arguments. When jobs are triggered, environmental variables are saved, checking the “permits jobs to be replicated” box.
Gitlab CI also supports, to varying degrees, a handful of “executors”. I say “executors” as the specific executor we use, Docker Machine, is more a poorly-supported provisioner than an executor.
Docker Machine
One of Docker Machine’s many features is handling most of the details involved in spinning up and tearing down EC2 Spot instances, such as setting maximum bid price.
Provided the commands you need to run can be executed within a Docker container, Docker Machine can serve as a decent provisioner, providing compute only as you need it.
Making Gitlab work well with Docker Machine
Spot instances are, as you may know, fantastically low-cost. They are low cost, because their availability both before and after provisioning is not guaranteed. Spot instance pricing changes with regional and zone demand, and provisioned spot instances can be terminated after a two minute warning.
Gitlab CI does not address spot instance termination. If a spot instance running a job is terminated, Gitlab CI eventually marks that job as failed. This is problematic because regardless of what you’re using it for, knowing if a task has failed or completed is too useful and too basic a feature to lose. The workaround to this issue we use the script check_termination.py
.
Handling Terminations
check_termination.py
is wrapped in a user-data.sh
file and passed to the instance as Docker Machine provisions it. user-data.sh
configures a cron job to run check-termination.py
every thirty seconds, allowing it to cancel all jobs if the instance is marked for termination. There’s a bit more to this script, so I’m going to go through it function-by-function.
Imports
The interesting thing here is the docker
package. This Docker library is unique in how high quality it is. I doubt there is any other Docker library, in any language, of such high quality.
#!/usr/bin/env python import docker import requests import os from sh import wall from sh import echo from pathlib import Path
Main
This script checks if any jobs are in need of termination and performs the termination if so.
def main(): if to_be_terminated(): terminate_jobs() if __name__ == '__main__': gitlab_api = “https://example.com/api/v4” main()
Check Termination
This function determines if the instance is to be terminated and if that termination needs to be addressed.
One thing to note is that the IP being checked is present on all instances on AWS and AWS clones.
def to_be_terminated(): if not Path("/clean-exit-attempted").is_file(): try: resp = requests.get('http://169.254.169.254/latest/meta-data/spot/termination-time') resp.raise_for_status() return True except: return False else: return False
Wall All
This function is mainly for debugging, using wall to broadcast messages to all ttys.
def wall_all(container, msg): wall(echo(msg)) container.exec_run(f'sh -c "echo '{msg}' | wall"')
Terminate Jobs
This function has several roles. It first acquires the job ID of the Gitlab CI job on the runner, and then cancels that job so it is not marked as having failed. Lastly, it retries the job, allowing the job to complete without user intervention.
It also will run the script /exit_cleanly.sh
if it exists, which is useful if your jobs are stateful in a way CI doesn’t quite support.
def terminate_jobs(): client = docker.from_env() Path("/clean-exit-attempted").touch() for container in client.containers.list(filters = {'status': 'running',}): try: jid = container.exec_run('sh -c "echo ${CI_JOB_ID?"NOJOB"}"')[1].decode('utf-8').strip('n') pid = container.exec_run('sh -c "echo ${CI_PROJECT_ID?"NOJOB"}"')[1].decode('utf-8').strip('n') if (pid != "NOJOB") and (jid != "NOJOB"): job_container = container container.exec_run('sh -c "/exit-cleanly.sh"') except: wall_all(job_container,f"Giving on on clean exit and restarting job {jid} of project {pid}.") pass kill_url = f"{gitlab_api}/projects/{pid}/jobs/{jid}/cancel" retry_url = f"{gitlab_api}/projects/{pid}/jobs/{jid}/retry?scope[]=pending&scope[]=running" auth_header = {'PRIVATE-TOKEN': os.environ.get('GITLAB_TOKEN')} killed = False tries = 20 while not killed and tries > 0: try: tries -= 1 resp = requests.post(kill_url, headers = auth_header) #gitlab status code print(resp.json()['id']) killed = True except: wall_all(job_container,'Failed to cancel job, retrying.') pass if killed: wall_all(job_container,"Cancellation successful.") retried = False tries = 20 while not retried and tries > 0: try: tries -= 1 resp = requests.post(retry_url, headers = auth_header) #gitlab status code print(resp.json()['id']) retried = True except: wall_all(job_container,'Failed to restart job, retrying.') pass if retried: wall_all(job_container, "Restarted job - That's all folks!!!.")
User Data
This user-data.sh
file contains the check_termination.py
. Docker Machine has AWS execute this script on instances once they are provisioned.
#!/bin/bash cat << "EOF" > /home/ubuntu/check_termination.py … EOF crontab << EOF GITLAB_TOKEN=gitlabtoken * * * * * python3.6 /home/ubuntu/check_termination.py EOF
Zombies?!?!
A problematic bug I had to workaround was Docker Machine abandoning provisioned instances, seemingly when it is rate-limited by AWS. The percentage of machines abandoned increase as the number of machines provisioned at once does. Fortunately, when this bug occurs, the instance in question is never tagged. As we only use Docker Machine to provision instances for CI, this allowed us to find and terminate instances meeting the criterion. The script we use is spot_sniper.py
.
Spot Sniper
This script terminates abandoned spot instances. While this script is straightforward, this breakdown serves a vital purpose: preventing runaway AWS bills. Abandoned instances are not counted against the configured resource limits, allowing them to accumulate.
Imports
Some standard imports.
#!/usr/bin/env python3 import boto3 from pprint import pprint import toml import os import syslog
Main
This script looks for and terminates abandoned spot instances.
There is a bug somewhere between Docker Machine and Gitlab Runner that causes instances to be abandoned.
Instances that are abandoned by this error are identifiable by the lack of a name
tag while having the docker-machine
security group.
This script also terminates instances abandoned by min_bid_price.py
as the instances that script provisions are configured to look abandoned.
def main(): regions = ['us-east-1', 'us-east-2'] for region in regions: ec2 = boto3.resource('ec2', region_name = region, aws_access_key_id=os.environ['AWS_KEY'],aws_secret_access_key=os.environ['AWS_SECRET']) all_ci_instances = set(ec2.instances.filter(Filters = [ {'Name': 'instance.group-name', 'Values': ['docker-machine']}, {'Name': 'instance-state-name', 'Values': ['running']}, ])) all_functional_ci_instances = set(ec2.instances.filter(Filters = [ {'Name': 'instance.group-name', 'Values': ['docker-machine']}, {'Name': 'instance-state-name', 'Values': ['running']}, {'Name': 'tag-key', 'Values': ['Name']}, ])) # This right here is how the bug somewhere between Docker Machine and Gitlab Runner expresses itself. horde = all_ci_instances - all_functional_ci_instances if len(horde) == 0: syslog.syslog("spot_sniper.py - No abandoned spot instances.") for zombie in horde: kill_with_fire(zombie) if __name__ == '__main__': main()
Kill With Fire
This function terminates abandoned instances. It’s main purpose is to allow zombie instances to be killed with fire, instead of simply being terminated.
def kill_with_fire(zombie): syslog.syslog(f"spot_sniper.py - Terminating zombie spot instance {zombie.id}.") zombie.terminate()
Crontab
This cron runs every 3 minutes. It needs to be tuned to minimize waste without causing excessive rate-limiting.
*/3 * * * * python3.6 /root/gitlab-ci-ami/spot_sniper.py
Optimizing Instances Used
As spot instance prices vary across regions, zones in those regions, instance types and time; costs can be minimized by checking across those axes. We wrote the script min_bid_price.py
to do this.
Min Bid Price
While min_bid_price.py
was initially intended to be a script run by cron to select the cheapest combination of region, zone and instance type; we also needed to determine instance availability. We found that requesting a few spot instances, waiting a few seconds, and checking if those instances were available was an effective way to do this.
The following breakdown details the what and why of each component of min_bid_price.py
.
Imports
There are a couple of interesting imports in this script:
#!/usr/bin/env python3 from sh import sed from sh import systemctl from functools import total_ordering
The sh
package is a package that wraps binaries on $PATH
, allowing them to be used in as pythonic a way as is possible without having to use a dedicated library.
total_ordering
is an annotation that, provided an equivalence and a comparison operator are defined, will generate the not-explicitly-defined equivalence and comparison operators.
Instance Profile Class
instance_profile
obtains, stores and simplifies the sorting of pricing info from AWS.
@total_ordering class instance_profile: def __init__(self, instance, region, zone): self.instance = instance self.region = region self.zone = zone self.price = None def determine_price(self, client): try: resp = client.describe_spot_price_history(InstanceTypes=[self.instance],MaxResults=1,ProductDescriptions=['Linux/UNIX (Amazon VPC)'],AvailabilityZone= self.region + self.zone) self.price = float(resp['SpotPriceHistory'][0]['SpotPrice']) return True except: return False def __eq__(self, other): if self.price == other.price: return True else: return False def __gt__(self, other): if self.price > other.price: return True else: return False def __str__(self): if self.price is None: return f"No price for {self.instance} {self.region}{self.zone}" else: return f"{self.instance} at {self.region}{self.zone} costing {self.price} at {datetime.datetime.now()}"
Main
There’s a fair amount going on here, so a few interruptions for the following function:
def main():
This code block specifies instances, regions and zones to be considered for use:
instances = ['m5.xlarge', 'm4.xlarge', 'c4.2xlarge', 'c5.2xlarge'] regions = ['us-east-1','us-east-2'] zones = ['a', 'b', 'c', 'd', 'e', 'f']
Here I specify the AMI to use in each region, as AMIs are not available across regions:
region_amis = {'us-east-1': 'ami-5bc0cf24', 'us-east-2': 'ami-3de9d358'}
The following code block specifies the criteria an instance_profile
must meet to be usable. It specifies that an instance_profile
must enable the provisioning of 3
instances via 3
separate spot instance requests in under 10
seconds when max bid price is 0.08
cents per hour:
max_bid = 0.08 test_instances = 3 wait_retries = 2
The following snippet assures the system configuration can be updated, before firing off AWS requests to find a suitable instance_profile
and updating system configuration.
if safe_to_update_config(): inst_confs = get_price_list(instances,regions,zones) for conf in inst_confs: if spot_test(conf.region, conf.zone, region_amis[conf.region], conf.instance, test_instances, max_bid, wait_retries): if safe_to_update_config(): syslog.syslog(f"min_bid_price.py - Min Price: {conf}") update_config(conf,region_amis) break else: syslog.syslog(f"min_bid_price.py - Cannot update config as jobs are running.") break else: syslog.syslog(f"min_bid_price.py - {conf} failed provisioning check.") else: syslog.syslog(f"min_bid_price.py - Cannot update config as jobs are running.") if __name__ == '__main__': main()
Update Config
The following block of code uses sed
, via sh
, to edit Gitlab Runners /etc/gitlab-runner/config.toml
configuration file.
If you anticipate needing to use multiple instance types, use the toml
package instead of sh
and sed
here.
def update_config(next_inst,region_amis): sed("-i", f"s/amazonec2-zone=[a-f]/amazonec2-zone={next_inst.zone}/g", "/etc/gitlab-runner/config.toml") sed("-i", f"s/amazonec2-ami=ami-[a-z0-9]*/amazonec2-ami={region_amis[next_inst.region]}/g", "/etc/gitlab-runner/config.toml") sed("-i", f"s/amazonec2-instance-type=[a-z0-9]*.[a-z0-9]*/amazonec2-instance-type={next_inst.instance}/g", "/etc/gitlab-runner/config.toml") systemctl("restart", "gitlab-runner") syslog.syslog(f"min_bid_price.py - Moved CI to {next_inst}")
Get Price List
The following function creates a boto3
client for each region being considered, and uses those clients to create a price-sorted list of instance_profile
objects.
def get_price_list(instances, regions, zones): price_list = [] for region in regions: client=boto3.client('ec2',region_name=region,aws_access_key_id=os.environ['AWS_KEY'], aws_secret_access_key=os.environ['AWS_SECRET']) for instance_type in instances: for zone in zones: price = instance_profile(instance_type, region, zone) if price.determine_price(client): price_list.append(price) price_list.sort() return price_list
Safe to Configure
This function determines if it is safe to update system configuration. It determines this by assuring that both:
- No CI jobs are running and,
- No non-zombie Docker Machine instances are running
def safe_to_update_config(): auth_header = {'PRIVATE-TOKEN': os.environ['GITLAB_TOKEN']} try: resp = requests.get('https://example.com/api/v4/runners/4/jobs?status=running', headers = auth_header) resp.raise_for_status() except: syslog.syslog('min_bid_price.py - Cannot get runner status from example.com. Something up?') return False if len(resp.json()) != 0: return False else: instances_running = "/root/.docker/machine/machines" if os.listdir(instances_running): return False return True
Spot Test
This function tests instance_profile
objects to determine their usability, by exploring if instances can be provisioned quickly enough for the instance_profile
in question.
If you’re wondering why instance_profile
is absent , it processes the components of instance_profile
.
def spot_test(region, availability_zone, ami, instance_type, instances, max_bid, wait_retries): client = boto3.client('ec2', region_name = region, aws_access_key_id=os.environ['AWS_KEY'],aws_secret_access_key=os.environ['AWS_SECRET']) req_ids = spot_up(client, instances, max_bid, ami, availability_zone, region, instance_type) usable_config = check_type_in_az(client, wait_retries, req_ids) spot_stop(client, req_ids) spot_down(client, req_ids) if usable_config: syslog.syslog(f"min_bid_price.py - {region}{availability_zone} {instance_type} wins as it spins up {instances} instances in {wait_retries*5} seconds at max_bid {max_bid}.") return True else: syslog.syslog(f"min_bid_price.py - {region}{availability_zone} {instance_type} loses as it fails to spins up {instances} instances in {wait_retries*5} seconds at max_bid {max_bid}.") return False
Spot Up
This function requests the specified number of spot instances and returns a list of the IDs of those requests.
def spot_up(client, instances, max_bid, ami, availability_zone, region, instance_type): responses = [] for i in range(instances): responses.append(client.request_spot_instances( LaunchSpecification={ 'ImageId': ami, 'InstanceType': instance_type, 'Placement': { 'AvailabilityZone': region + availability_zone, }, }, SpotPrice= str(max_bid), Type='one-time', InstanceInterruptionBehavior='terminate') ) return [x["SpotInstanceRequests"][0]["SpotInstanceRequestId"] for x in responses]
Spot Stop
This function cancels outstanding spot instance requests. It can fail when the system is being rate limited by AWS. Requests not cancelled will be fulfilled and cleaned up by spot_sniper.py
. This failure is permitted as it results in stderr being emailed via cron, letting us know to not slam the system with jobs for a couple minutes.
def spot_stop(client, req_ids): cancellations = (client.cancel_spot_instance_requests(SpotInstanceRequestIds=[x])["CancelledSpotInstanceRequests"][0]["State"] == "cancelled" for x in req_ids) while False in cancellations: print(f"min_bid_price.py - Failed to cancel all spot requests, retrying") time.sleep(5) cancellations = (client.cancel_spot_instance_requests(SpotInstanceRequestIds=[x])["CancelledSpotInstanceRequests"][0]["State"] == "cancelled" for x in req_ids)
Spot Down
This function terminates provisioned spot instances. It can fail when the system is being rate limited by AWS, in which case spot_sniper.py
will clean up the provisioned instances during its next pass.
def spot_down(client, req_ids): instances = [client.describe_spot_instance_requests(SpotInstanceRequestIds = [x]) for x in req_ids] terminate_ids = [] for x in instances: try: terminate_ids.append(x["SpotInstanceRequests"][0]["InstanceId"]) except KeyError: pass if len(terminate_ids) > 0: client.terminate_instances(InstanceIds = terminate_ids)
Check Instance Type in AZ
This function checks the status of spot instance requests made every five seconds until either the specified number of retries are made, all requests are fulfilled, or one request will not be fulfilled.
def check_type_in_az(client, wait_retries, req_ids): statuses = spot_req_status(client, req_ids) while wait_retries > 0 and req_status_check(statuses) is None: wait_retries -= 1 statuses = spot_req_status(client, req_ids) time.sleep(5) if req_status_check(statuses) is None: return False else: return req_status_check(statuses)
Spot Request Status
This function returns a list of the statuses of spot requests made.
def spot_req_status(client, req_ids): return [client.describe_spot_instance_requests(SpotInstanceRequestIds=[x])["SpotInstanceRequests"][0]["Status"] for x in req_ids]
Request Status Check
This function reduces a list of spot requests to a boolean once their success can be determined, returning None
if their success cannot be determined.
def req_status_check(statuses): for x in statuses: if (x["Code"] == "pending-evaluation") or (x["Code"] == "pending-fulfillment"): return None elif x["Code"] != "fulfilled": syslog.syslog(f"Fail req_status: {x['Code']}") return False else: pass return True
Crontab
This crontab runs min_bid_price.py
every 10 minutes.
The period of this cron needs to be tuned for your use case.
If it is too wide, it is less likely that system configuration will be updated when users are active.
If it is too narrow, the cost of determining instance availability will increase as instances are billed by the minute for their first minute.
*/10 * * * * python3.6 /root/gitlab-ci-ami/min_bid_price.py
Config, config, config…
AMI
As bandwidth costs on AWS can add up and Spot Instance usage is billed by the second (after the first minute), we pre-load a handful of images we use often into the AMI used by Docker Machine to provision instances.
Gitlab Runner Config
This is the config.toml
we use for Gitlab Runner. Key points to note are the volume mounts it configures, and the max builds limitation. The volume mounts are configured to allow CI jobs to use volume mounts of their own. MaxBuilds
being set to 1 prevents port conflicts from occurring and ensures that all jobs are run in a clean environment.
concurrent = 80 check_interval = 0 [[runners]] name = "alpine" limit = 80 url = "https://example.com/“ token = “XXXXX” executor = "docker+machine" output_limit = 16384
[runners.docker]
tls_verify = true image = “BUILD_IMAGE_TAG” privileged = true disable_cache = true shm_size = 0 volumes = [“/var/run/docker.sock:/var/run/docker.sock”,”/builds:/builds”,”/cache:/cache”]
[runners.cache]
[runners.machine]
MachineDriver = “amazonec2” MaxBuilds = 1 MachineName = “gitlab-docker-machine-%s” OffPeakIdleCount = 0 OffPeakIdleTime = 0 IdleCount = 0 IdleTime = 0 MachineOptions = [ “amazonec2-request-spot-instance=true”, “amazonec2-spot-price=0.080”, “amazonec2-access-key=XXXXX”, “amazonec2-secret-key=XXXXX”, “amazonec2-ssh-user=ubuntu”, “amazonec2-region=us-east-2”, “amazonec2-instance-type=m4.xlarge”, “amazonec2-ami=ami-XXXXX”, “amazonec2-root-size=50”, “amazonec2-zone=a”, “amazonec2-userdata=/etc/gitlab-runner/user-data.sh”, “amazonec2-use-ebs-optimized-instance=true”, ]
Docker Daemon Config
The following is the contents of /etc/docker/daemon.json
on all CI machines. It configures the Docker daemon to use Google’s mirror of Dockerhub when Dockerhub is down or having reliability issues. It also limits the size of Docker logs (a source of many filled disks).
{ "registry-mirrors": ["https://mirror.gcr.io"], "log-driver": "json-file", "log-opts": {"max-size": "10m", "max-file": "3"} }
Metrics
The following table contains some metrics on the cost of our configuration over the past six months:
Instance Type | Instance Count | Total Job Hours | Cost | Cost Relative to On Demand, Always On, 4 | Turnaround Time Relative to On Demand, Always On, 4 |
---|---|---|---|---|---|
On demand, Always On | 4 | 1938.74 | 3423.84 | 100% | 100% |
On demand, As Needed | 80 Max | 1938.74 | 378.87 | 11.07% | 0.05% |
Spot | 80 Max | 1938.74 | 80.69 | 2.36% | 0.05% |
The following histogram shows the durations of jobs ran since we started using CI:
The following plot shows the maximum number of jobs we’ve had running at once, over time:
One thing these metrics do not capture is the impact of checking the availability of instance_profile
has on job durations.
Before running this check, job startup times would often go as high as 6 minutes and would occasionally end up stuck in a “Pending” state due to a lack of available compute.
Job startup times now rarely exceed 1 minute and they do not get stuck “Pending”.
Changes since this was started
Docker Machine Gitlab MR
This MR, which was added in Gitlab 11.1, raised the number of CI jobs we could have running at once to at least 80. Given the performance claims and the number of jobs we could run at once before being rate limited by AWS before this MR was merged, I would guess we could run somewhere between 200 and 250 jobs at once before being rate limited by AWS.
https://gitlab.com/gitlab-org/gitlab-runner/merge_requests/909
Meltano
While working on this project, Gitlab announced their Meltano project. While the goal of Meltano might not be enabling the use of CI to process versioned data, that will almost certainly be a component. As the purpose of this CI configuration was to allowing us to use CI to process versioned data, I expect that the performance and capabilities of this CI configuration will increase as bugs related to Gitlab Runner and Docker Machine are addressed for Meltano.
https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/
Spot Pricing Change
AWS recently changed how they calculate the price of spot instances, smoothing price fluctuations.
While this change reduces the benefit of the approach of finding the optimal instance_profile
to use to run instances, the approach of finding the optimal instance_profile
still allows us to use the cheapest instances meeting our compute, startup-time and compute capacity requirements.
https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/
So…
That’s left us with more than we need to get our jobs ran in a timely manner.
This paragraph was initially going to be:
There are a few yet-to-be-a-problem cases these scripts have not addressed, such as ignoring sold out instance-region-zone combinations and automatically restarting jobs that are cancelled due to instance price increases and automating the generation (and use of) new pre-loaded AMIs periodically.
but, as our needs grew, all those problems had to be addressed.
See https://github.com/twosixlabs/gitlab-ci-scripts for a more copy-paste friendly version of the scripts on this page.