Blog

Developing a Serverless Machine Learning Product as a DevOps Engineer

Serverless Machine Learning

Creating a Machine Learning Product

After looking through about ten different “trending technology buzzwords for 2018” articles, DevOps was nowhere to be seen. It seems we have been ingrained into the ecosystem and are now just a simple fact of day to day software engineering. However, there was a special word that kept showing up no matter how click-baitey the article looked. “Machine Learning” and its ever-piercing gaze on businesses trying to find any way to integrate it into their architecture. I figured since DevOps professionals are no longer the coveted cash cow, it’s time to learn some new skills. After getting a base understanding of machine learning through Google’s crash course, it was time to take what I learned and create my own product.

After some brain-storming, I came up with the idea to create an “AI” tweet bot that learns from a user’s tweets, then generates its own.

Data

Normally, when I start to create a pipeline for developers, I want them to able to get their code up first, so we can run it through SonarQube, run some Coverity report, and print out all sorts of cute metrics first. When it comes to machine learning, data is king. Data is the code. If there’s a bug in the data, everything else falls apart. Datasets are the heart of any ML product, and I wanted to get this right so I could begin my continuous integration journey on the right foot. In my studies, all the datasets were nice and tidily optimized to be used right away with training models. In the real world, not so much. Luckily for my tweet bot, the only data I need is, well, tweets. After going through the grueling twitter API signup process, I was ready to scrape some tweets for my AI bot.

require 'twitter' 
require 'date' 
require "csv" 


client = Twitter::REST::Client.new do |config| 
  config.consumer_key = "" 
  config.consumer_secret = "" 
  config.bearer_token = "" 
end 


args = ARGV 
handle = args[0] 
file_output = args[1] 


puts "Grabbing all tweets for user "+handle 
def collect_with_max_id(collection=[], max_id=nil, &block) 
  response = yield(max_id) 
  collection += response 
  response.empty? ? collection.flatten : collect_with_max_id(collection, response.last.id - 1, &block) 
end 

def client.get_all_tweets(user, tweets) 
  collect_with_max_id do |max_id| 
    options = {count: 200, include_rts: true} 
    options[:max_id] = max_id unless max_id.nil? 
    user_timeline(user, options).each do | tweet | 

      tweets.push(tweet.text) 
    end 
  end 
end 

tweets =[] 
client.get_all_tweets(handle, tweets) 

File.open(file_output,'w') do |file| 
  tweets.each do |tweet| 
    file.puts tweet 
  end 
end

Before continuing on, I ask my self: is this repeatable? Can it be automated? Luckily, the script is self-contained and takes in simple argument to create an output file location and grab a user. If I was feeling fancy, I could have even uploaded it to s3 with versioning turned on to have different versions of tweets. But let’s not get ahead of ourselves. Let’s build out our pipeline first now that we have the heart of our application.

Training

Next step in any DevOps pipeline after getting the code is, well, building the code. However, there will be no WAR files, node scripts, or Docker images here (yet).

The next step will be to train a model.  My crash course taught me the basics of how to create a linear classifier and a neural network with some fun interactive notebooks. However, my use case is a bit complex, and throwing data into a predefined model in Tensorflow isn’t going to cut it this time.  Luckily, GitHub has a trove of projects to model from, and, what do you know, a simple to use Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow is right on GitHub (https://github.com/sherjilozair/char-rnn-tensorflow).

The class didn’t teach me that much, but the README gives me a clear, concise way to start training the model and all it needs is a text file, which we conveniently created. After setting up the directory, it was time to begin the training.

train.py

Use the model

Ok, that took some time, but now we got a model, so what’s next? Well, just like in CI/CD pipeline after the product is built, we need to verify it works.

sample.py
darpa_tweet

Ok, that totally looks like a human tweeted that. Well, it looks like this AI bot won’t be considered a deepfake anytime soon. But it seemed to capture the essence of the character structure and even created a link (not sure where that goes…), and we did create an end to end model pipeline for collecting data, training the model, and making an inference. And since everything is running through scripts, it’s pretty easy to add them together for a push button modeling pipeline.

Retrain

Well, now that we got the training pipeline working, what do we do when we want to retrain our model or try a different twitter handle? Just like any good CloudFormation script or Jenkins build, we have model parameters known as hyperparameters! These special inputs allow the model to train differently and get different results. Unlike some classification models, it’s hard to “score” this model because it just outputs text instead of verifying “hotdog” or “not hotdog”. It will take lots of fine-tuning to find a model we want. Nevertheless, we’ll try it for the sake of repeatability.

[banjtheman@ubuntu]char-rnn-tensorflow(master|!?)> python train.py --data_dir=./data/darpa/ --seq_length 64 --rnn_size 120 --input_keep_prob 0.8 --output_keep_prob 0.5

Hmm. Not much better, it seems. Optimizing neural networks have been described as  “dark art“, and with the limited source of about 5000 tweets, our model may not get any better. Nevertheless, the principles of creating the machine learning product and providing a repeatable framework are captured in my experiment.

Productize

Its nice that we have a great CLI tool, but my experience has told me we need a RESTful API endpoint for our machine learning app, in order for it to be useful. Since there is only one “call” we can easily make a flask app to get output from our trained model and expose it to an endpoint.

from flask import Flask 
from flask import jsonify 
from flask_cors import CORS 
import commands 
import os 
import json 

app = Flask(__name__) 
CORS(app) 


@app.route("/") 
def tweet(): 

    status, text = commands.getstatusoutput('python sample.py') 
    data={} 
    data["text"]=text 
    return jsonify(data) 

if __name__ == "__main__": 
    app.run(host='0.0.0.0', port=8080)

This simple flask app will run our model generation code, and then store its output in a JSON object for us to retrieve with each call.

[banjtheman@ubuntu]char-rnn-tensorflow(master|!?)> python app.py 
 * Debug mode: off 
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit) 
[banjtheman@ubuntu]char-rnn-tensorflow(master|!?)> curl localhost:8080 
127.0.0.1 - - [24/Aug/2018 21:32:00] "GET / HTTP/1.1" 200 - 
{"text":" Chellenges Program sover of the gohcheric of theralmer we're the SHONY from http://t.co/YxKdSGjOu3nThis tentesips to hals sigated dancort vancodes stopyreath iamallaces vods of whis RAR https://t.co/XyMtq0oJgK by #Loaltartalter-#aipt https:"}

Awesome now we have our server that can give us back an inference from our trained model.

Dockerize

We have walked through a lot of steps, but it’s finally time to through everything into the coveted docker container so we can rapidly deploy our model to any platform we chose. Since our model was trained with tensorflow it’s easy to start with the tensorflow base image to stuff all our content.

FROM tensorflow/tensorflow 

RUN pip install flask 
RUN pip install flask_cors 
RUN mkdir -p /home/twosix/ 
OPY char-rnn-tensorflow /home/twosix/tweet-bot 

EXPOSE 8080 
WORKDIR /home/twosix/tweet-bot 
CMD python /home/twosix/tweet-bot/app.py

Now to build, and deploy

docker build -t tweet-bot:v-1 . 
docker run -d -p 8080:8080 tweet-bot:v-1 
curl localhost:8080 
{"text":" too for s! S@Drime Fund. http://t.co/DjAaHZLEx2,u2026 https://t.co/AHBXBmUXA8 #GartwnThe Senfor MAPS PSETAM Arday Anrantion Connalr Hether, banes panse jekto offace buolozutions, Gorene Don vinguan provican bread to sew. DARPA ulortact From DAR"}

Sweet, now we have a trained dockerized model, with a RESTful API endpoint.

Serverless Deployment

Ok we have completed does it work on my machine test quite a bit now, but can we scale this image up? We can leverage AWS Fargate to create serverless docker images so we can easily scale, and deploy our trained model.

In order to leverage Fargate, we need to push our image to Amazon Elastic Container Registry (ECR), which allows us to store docker images.

aws ecr get-login --no-include-email --region us-east-1 
docker build -t tweet-bot . 
docker tag tweet-bot:latest MY_URL.us-east-1.amazonaws.com/tweet-bot:latest 
docker push MY_URL.us-east-1.amazonaws.com/tweet-bot:latest

Once our image has been pushed, we can create a Fargate build for our image. First, we need a JSON configuration file for our image.

{ 
    "requiresCompatibilities": [ 
        "FARGATE" 
    ], 
    "containerDefinitions": [ 
        { 
            "name": "tweet-bot", 
            "image": "IDNUM.dkr.ecr.us-east-1.amazonaws.com/tweet-bot", 
            "essential": true, 
            "portMappings": [ 
                { 
                    "containerPort": 8080, 
                    "protocol": "tcp" 
                } 
            ], 
        } 
    ], 
    "networkMode": "awsvpc", 
    "memory": "2048", 
    "cpu": "1024", 
    "executionRoleArn": "arn:aws:iam::IDNUM:role/ecsTaskExecutionRole", 
    "family": "tweet-BOT", 
    "taskRoleArn": "arn:aws:iam::IDNUM:role/ecsTaskExecutionRole" 
}

Next, we can create our service and then deploy it, and then hit the endpoint.

#launch the service 
[banjtheman@ubuntu]aws> aws ecs register-task-definition --cli-input-json file://fargate-task.json 
[banjtheman@ubuntu]aws> aws ecs create-service --cluster default --service-name tweet-service --task-definition tweet-test:1 --desired-count 1 --launch-type "FARGATE" --network-configuration "awsvpcConfiguration={subnets=[subnet-abcd1234],securityGroups=[sg-abcd1234],assignPublicIp='ENABLED'}" 

#get the public ip of our service 
[banjtheman@ubuntu]aws> aws ecs describe-services --services tweet-bot | grep tasks 
                    "message": "(service tweet-bot) has started 1 tasks: (task 5a0e69c1-462c-4206-8cc7-c9e08d861f27).", 
[banjtheman@ubuntu]aws> aws ecs describe-tasks --tasks 5a0e69c1-462c-4206-8cc7-c9e08d861f27 | grep eni 
                            "value": "eni-d4470051" 
[banjtheman@ubuntu]aws> aws ec2 describe-network-interfaces --network-interface-ids eni-d4470051 | grep PublicIp 
                        "PublicIp": "34.201.11.93" 
#test our bot 
[banjtheman@ubuntu]aws> curl 34.201.11.93:8080 
{"text":" The Our glen gisess at: Where..nFity you take of USSE/Rafc ##dopios, year melegle for his ore strigaweries an myen natune of #AorgatnNow _RPSC to diven Iffan: (dislast jademotth shares sherting for #starren spriad to cave: submecnime. http"} 

#clean up 
[banjtheman@ubuntu]aws> aws ecs stop-task --task 5a0e69c1-462c-4206-8cc7-c9e08d861f27

Nice we just deployed a serverless trained machine learning model, with a RESTful API endpoint.

Push Button Solution

We were able to go from collecting data, training a model, dockerizing the model, and then deploying a serverless image that can seamlessly scale.  We can break down our path into 4 steps, data collection, model training, dockerizing, and deploying.  With the following bash script we can run our entire parameterized CI/CD pipeline, that will get our data, train our model, create a self-contained image, and deploy it to a serverless architecture.

#!/bin/bash

HANDLE=$1
VERSION=$2
TRAIN_ARGS=$3
SCALE=$4

echo "Running DevOps Pipeline"

echo "collecting data for $HANDLE"
ruby tweet_text.rb $HANDLE

echo "Starting training..."
train.py $TRAIN_ARGS
echo "Training finished"

echo "Starting docker build"
docker build -t $HANDLE_tweet-bot .
docker tag tweet-bot:version MY_URL.us-east-1.amazonaws.com/$HANDLE_tweet-bot:version
docker push MY_URL.us-east-1.amazonaws.com/tweet-bot:latest
echo "Docker image pushed to ECR"

echo "Deploying $SCALE images"
aws ecs register-task-definition --cli-input-json file://fargate-task.json | sed -i 's/IMAGE_PLACE_HOLDER/$HANDLE_tweet-bot:version'
aws ecs create-service --cluster default --service-name tweet-service --task-definition HANDLE_tweet-bot:version --desired-count $SCALE--launch-type "FARGATE" --network-configuration "awsvpcConfiguration={subnets=[subnet-abcd],securityGroups=[sg-abcd],assignPublicIp='ENABLED'}"

Did we DevOps?

Time to evaluate if we brought some DevOps in building a machine learning product. Here’s are a couple key DevOps principles I adhere to when working on any project.

Automated Builds

We want to be able easily define how we are going to build, version, and refine our application each time we want to deploy a new version in an automated fashion. Our setup allows for a parameterized push-button pipeline by starting the data collection process with a twitter handle to get tweets from, a training step with the ability to edit hyperparameters on the fly, and finally, our docker builds are easily versioned by image name and tag name. It’s sufficient to say we have very configurable setup, but we still have to manually kick off the script, for our use case we might want to retrain a model whenever our target user sends a new tweet, which would require us to set up some type of polling, to check for new tweets and then kick off our pipeline.

Scalable Solutions

When we want to deploy code,  worrying about servers and provisioning always haunts the minds of any site reliability engineer tasked with keeping services running. Luckily for us our serverless architecture with AWS Fargate hides us away from servers so we don’t have to worry about instance types, managing cluster scheduling, or optimizing cluster utilization. We just need to focus on our docker image. Score!!

Product-focused

DevOps at its heart is getting distinct parts to be modularized and put into a turn key solution that can be deployed at scale. Our process involved grabbing code from a GitHub repo, getting raw tweets from users on Twitter, and the use of docker to contain a trained model bundled with a web server to create an easy to use API that can be easily scaled due to Fargate. We turned a python CLI tool into a repeatable,  scalable, and dockerized RESTful API that can be deployed to the public cloud with a few lines of code.

Takeaway

Like the other trending 2018 buzzword blockchain, there is much to capture in the machine learning space, and while some ML products are far along, most of the projects are still in their infancy. Experiencing the process and getting your feet wet is the best way to learn. DevOps is a framework, and it can be applied to machine learning the same as any other type of software engineering to build a training foundation, optimizing model creation and providing an elegant integration process to acquire operationally inferred results.

… and when an AI does eventually manage your AWS environment, I’ll be glad that I took the leap in getting DevOps completely automated.