Time Series Machine Learning with Excel Data – ML Tutorial

Time Series Machine Learning with Excel Data – ML Tutorial

Time Series AWS Machine Learning from Excel Data - Title Slide
(>45 Minute Read)

Time series forecasting with Machine Learning

Time series data forecasting can be beneficial is countless applications. From predicting supply chain demands, machine downtime, customer conversions, and so much more. This tutorial will assume you’re a complete beginner with the concepts presented below. Hopefully not skipping a single step in getting your ML DeepAR Algorithm up and running. 

I was asked to develop a better forecasting system within the supply chain at my current place of employment. I did this with AWS’s Sagemaker, DeepAR algorithm, and the use of excel data we had been keeping for years. This post is going to serve as a step by step tutorial to implement your own machine learning deepAR algorithm to better predict future data using my project as a backbone. 

What We’ll Cover in This Post

  1. How to begin using AWS’s services to implement Machine Learning. This will include data storage with S3 and machine learning with SageMaker.
  2. SageMaker uses Jupyter Notebooks, so we’ll need to cover this briefly as well. 
  3. How to load your Excel Spreadsheets into SageMaker to begin dissecting their data and predicting future data. We’ll do this with Pandas.
  4.  Training your model and setting your hyper parameters, and what those mean. 
  5. Evaluating your model’s performance. 
  6. Accurately predicting future values.

This post should include everything you need to know. So buckle up and let’s get started. 

AWS Machine Learning Tutorial Example Title

AWS SageMaker and S3 

Amazon’s Web Services (AWS) has anything you can imagine for their services as a product. We’ll only be using a fraction of what’s available in this tutorial. But if you’ve never used their services before, now is a great time to start. AWS is my go to for some project types and my fall back for other projects. If you have no idea on where to start, AWS likely has a starting point solution. 

Simple Storage Solution 

To begin, we’ll need to use AWS’s file storage solution, S3. This will hold all the files that we want our machine learning model to be taught off of. We’ll train the model with datasets, in this case an excel spreadsheet, that we first upload to S3.

If you’re familiar with S3 feel free to skip this section. If not, let’s go through creating an S3 bucket. 

Creating an S3 Bucket for our time series data.

  • Go ahead and login to the AWS Console (Create an account if you do not currently have one.) AWs does have a free tier that you can use, however some of the features we use in this tutorial are paid. I spent maybe $2.50 on the whole process though. 
    Amazon Console can be found here.
  • Navigate to AWS’s storage selection of services and select S3

Then we’re going to create a bucket for the service SageMaker. 

  1. Click the create button
  2. Name your bucket something with ‘sgemaker’ at the beginning of it like the picture shows below. Buckets have to be unique, so you won’t be able to name your’s exactly like mine, but any name works as long as ‘sagemaker’ is at the beginning.
  3. Click through the steps keeping all the default settings. 
Sagemaker Bucket Creation

AWS SageMaker 

SageMaker is Amazon’s machine learning service. There’s several different algorithms that you can use. And I’ll denote these whenever we get to that step in their code. However, for this tutorial we’ll be working off of their DeepAR Algorithm. 

AWS SageMaker Title Thumbnail

Their DeepAR algorithm considers the entire dataset rather than just one time series. This is unique and pretty cool. Especially in the application I needed to use machine learning for. This gives a good picture of the entire set of data if all the time series depend on each other – part consumption in a factory is dependent on all the other parts (You only have so many employees, hours, etc.)

  1. Go ahead and create a Notebook instance. It doesn’t really matter what you name your notebook instance.
  2. Activate your instance whenever you create it – this may take a minute to get your instance operational. 
  3. Then, open your notebook instance. 

This notebook is a Juypter notebook. If you’ve never worked with a Juypter notebook before, you’re missing out. It’s pretty much just an environment to store, execute, and denote code in an easily sharable way. We’ll do a very brief tutorial below on it, but I encourage you to get good with Juypter!

AWS SageMaker Machine Learning Selection - Thumbnail

Juypter Notebooks

Juypter Notebooks are what AWS’s Machine Learning services function out of. Which is super convenient for our purposes. 
 
Once you’ve opened your notebook, go to the top right corner and select the new drop down tab.
 
From here, we’ll be using Python 3 for our tutorial, so go ahead and select that instance. 
 
This will allow us to code our own extensions of the DeepAR Algorithm along with easily being able to implement Amazon’s code from their tutorials. 
 
If you’re interested in their tutorials, you can find them all on GitHub and they’re all pretty helpful if you want to get more into Machine Learning. 
 
You can find all those tutorials here
Conda Python Instance of Juypter Notebook

Let’s get coding & training 

I’ve been told before that most of machine learning is feature engineering. Which is pretty much just figuring out what data is valuable to you and being able to use it. 

A common stumbling point during implementing machine learning tends to be putting the data into a form that you can actually use to train your model. 

I’ll include my sample data here. Feel free to use it throughout the tutorial, or construct data of your own to the same layout as the photo below. 

If you want to follow along exactly to this tutorial with your own data, please make sure these certain things are the same. 

The first column needs to include your index values, or the times, that you’ll want to compare your time series again. This can be in Days, Weeks, Hours, or Months. I couldn’t quite get this to work with months that start with the first day of the month rather than the last. So if you’re going to do months, go ahead and modify your data either programatically or within excel to month end. 

Secondly, the headings of each column can be the name of the time series. In my example, these heading names are material numbers. 

Lastly, the values under these headings need to be numerical values that represent some variable for your heading. For me, these are the amount of consumed materials during that month.

Code in Juypter Notebook 

Now let’s go to the fun part. The code of our Machine Learning Script. This code will go into your notebook instances in the form of an executable Juypter notebook. 

If you’re just here for the code I’m using, as always, I post all that code on my Github account. You can find this example Here.

Loading the Requirements into your Juypter Notebook
%matplotlib inline

import sys
from urllib.request import urlretrieve
import zipfile
from dateutil.parser import parse
import json
from random import shuffle
import random
import datetime
import os



import boto3
import s3fs
import sagemaker
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import IntSlider, FloatSlider, Checkbox
!conda install -y s3fs
Using a Random Seed – allows replication of results
np.random.seed(1)
Final package imports
import boto3
import s3fs
import sagemaker
from sagemaker import get_execution_role
Specifying bucket location

The below code is denoting the location of your bucket. The directory can be defined by using the s3_data_path  line using the .format(keywords)

We’ll also need to start a SageMaker session at this step

bucket=’sagemaker-us-east-1-074654881237′
prefix = ‘prysmian-forecasting’ #Used for denoting storage location

sagemaker_session = sagemaker.Session()
role = get_execution_role()

folder1=’sagemaker’ #First Directory within Bucket
folder2=’data’ # Second Directory within bucket
folder3=’train’ # Third Directory within Bucket
data_key = ‘TransposedConsumptionEdit.xlsx’ #File name


s3_data_path = ‘s3://{}/{}/{}/{}/{}’.format(bucket, folder1, folder2, folder3, data_key)
s3_output_path = “{}/{}/output”.format(bucket, prefix)

Here we’ll specify the image we’re using. I want to use AWS’s DeepAR Algorithm. So I specify that here with my image_name

from sagemaker.amazon.amazon_estimator import get_image_uri
image_name = get_image_uri(boto3.Session().region_name, ‘forecasting-deepar’)

Next we’ll set up our indexing values and frequency of our values. My data is in monthly increments, (granularity of a month) – change this to whatever granularity you’re predicting in your own spreadsheet. 

freq = ‘M’ 
prediction_length = 5 #How many months into the future we want to predict into the future
context_length = 12 #How much context we want to supply to the algorithm – Year is more than enough

Next we want to read in our data from our excel spreadsheet. We want this code below to be dynamic to the total number of time series in the excel sheet. Since Amazon’s DeepAR algorithm considers the entire set, we’ll need to be sure to include the entire set. 

data = pd.read_excel(s3_data_path, parse_dates=True, index_col=0) #Sets index column as the first column in spreadsheet (The Dates)
num_timeseries = data.shape[1] #Looks at the shape of the first column of data to determine the number of time series
data_length=data.index.size #Determines the dynamic length of data in each time series (changes and gets smarter month to month)
print(“This is the number of time series you’re running through the algorithm (This many materials):”)
print(num_timeseries)
print(“This is the # of months of data that your spreadsheet has:”)
print(data_length)
t0 = data.index[0]
print(“This is the beginning date:”)
print(t0)
time_series=[] #Defines the structure of the time series data

for i in range(num_timeseries):
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data.iloc[:,i], index=index)) #Must treat this dynamically, so for loop (Length changes monthly)

print(time_series[2]) #Visual to make sure data is loaded correctly. No Graph = no bueno amigo
time_series[2].plot()
plt.show()

Now we want to create the format of our time series data, and append all these lists into that structure. 

time_series_training = [] #Creates structure of our training data
for ts in time_series:
time_series_training.append(ts[:-prediction_length]) #appends everything except the prediction length, default of five months because that’s what we’re predicting

This step below is to just be a visual representation that everything is being loaded in the way we want it to be. 

time_series[2].plot(label=’test’)
time_series_training[2].plot(label=’train’, ls=’:’) #Fancy way of determining what we want the model to predict later
plt.legend()
plt.show()

We need to specify that we want to convert this data to JSON notation. 

def series_to_obj(ts, cat=None):
obj = {“start”: str(ts.index[0]), “target”: list(ts)}
if cat is not None:
obj[“cat”] = cat
return obj

def series_to_jsonline(ts, cat=None):
return json.dumps(series_to_obj(ts, cat)) #Most annoying process is conversion to JSON data type – makes it easier to dump excel data

We encode with UTF-8 formatting because to convert this to JSON, it must be a byte type. 

encoding = “utf-8” #Takes JSON to unicode b/c have to encode data for s3 writing
s3filesystem = s3fs.S3FileSystem() # Data: Excel -> Dataframe -> series appended -> JSON -> UTF-8 byte type -> S3 Bucket JSON. Brutal Dude

with s3filesystem.open(s3_data_path + “/train/train.json”, ‘wb’) as fp:
for ts in time_series_training:
fp.write(series_to_jsonline(ts).encode(encoding))
fp.write(‘\n’.encode(encoding))

with s3filesystem.open(s3_data_path + “/test/test.json”, ‘wb’) as fp:
for ts in time_series:
fp.write(series_to_jsonline(ts).encode(encoding))
fp.write(‘\n’.encode(encoding))

Now we set the estimator and the instance type. I’m not too familiar with the sizes of these instances to be honest, so I just used the default one in the deepAR tutorial for this dataset as well. 

estimator = sagemaker.estimator.Estimator(
sagemaker_session=sagemaker_session,
image_name=image_name,
role=role,
train_instance_count=1,
train_instance_type=’ml.c4.xlarge’,
base_job_name=’DEMO-deepar’,
output_path=”s3://” + s3_output_path
)

Hyperparameters

This step is crucial and can change with your own data. Hyperparameters are the things you’ll change to get more accurate predictions later on in your code. There’s a whole science behind hyperparameters and their impact on your model’s training. But in short, the more epochs, or “recounts” of your data, the better it performs in most causes. Learning rates can be adjusted if your data is super large and you don’t want to burn a lot of computing power everytime you want to train your model. These values work pretty well in my experience. 

hyperparameters = {
“time_freq”: freq,
“context_length”: str(context_length),
“prediction_length”: str(prediction_length),
“num_cells”: “40”,
“num_layers”: “3”,
“likelihood”: “gaussian”,
“epochs”: “80”,
“mini_batch_size”: “32”,
“learning_rate”: “0.001”,
“dropout_rate”: “0.05”,
“early_stopping_patience”: “10”
}

estimator.set_hyperparameters(**hyperparameters)

%%time
data_channels = {

“train”: “{}/train/”.format(s3_data_path),
“test”: “{}/test/”.format(s3_data_path)
}

estimator.fit(inputs=data_channels, wait=True)

This step may take a bit of time, but it’s creating the job.

job_name = estimator.latest_training_job.name

endpoint_name = sagemaker_session.endpoint_from_job(
job_name=job_name,
initial_instance_count=1,
instance_type=’ml.m4.xlarge’,
deployment_image=image_name,
role=role
)

estimator.fit(inputs=data_channels, wait=True)

These are AWS’s built predictor functions. You can change these if it’s a must, but I chose not to touch these from the tutorials. 

class DeepARPredictor(sagemaker.predictor.RealTimePredictor):

def set_prediction_parameters(self, freq, prediction_length):
“””Set the time frequency and prediction length parameters. This method **must** be called
before being able to use `predict`.

Parameters:
freq — string indicating the time frequency
prediction_length — integer, number of predicted time points

Return value: none.
“””
self.freq = freq
self.prediction_length = prediction_length

def predict(self, ts, cat=None, encoding=”utf-8″, num_samples=100, quantiles=[“0.1”, “0.5”, “0.9”]):
“””Requests the prediction of for the time series listed in `ts`, each with the (optional)
corresponding category listed in `cat`.

Parameters:
ts — list of `pandas.Series` objects, the time series to predict
cat — list of integers (default: None)
encoding — string, encoding to use for the request (default: “utf-8”)
num_samples — integer, number of samples to compute at prediction time (default: 100)
quantiles — list of strings specifying the quantiles to compute (default: [“0.1”, “0.5”, “0.9”])

Return value: list of `pandas.DataFrame` objects, each containing the predictions
“””
prediction_times = [x.index[-1]+1 for x in ts]
req = self.__encode_request(ts, cat, encoding, num_samples, quantiles)
res = super(DeepARPredictor, self).predict(req)
return self.__decode_response(res, prediction_times, encoding)

def __encode_request(self, ts, cat, encoding, num_samples, quantiles):
instances = [series_to_obj(ts[k], cat[k] if cat else None) for k in range(len(ts))]
configuration = {“num_samples”: num_samples, “output_types”: [“quantiles”], “quantiles”: quantiles}
http_request_data = {“instances”: instances, “configuration”: configuration}
return json.dumps(http_request_data).encode(encoding)

def __decode_response(self, response, prediction_times, encoding):
response_data = json.loads(response.decode(encoding))
list_of_df = []
for k in range(len(prediction_times)):
prediction_index = pd.DatetimeIndex(start=prediction_times[k], freq=self.freq, periods=self.prediction_length)
list_of_df.append(pd.DataFrame(data=response_data[‘predictions’][k][‘quantiles’], index=prediction_index))
return list_of_df

Setting up the parameters to fit the above AWS function

predictor = DeepARPredictor(
endpoint=endpoint_name,
sagemaker_session=sagemaker_session,
content_type=”application/json”
)
predictor.set_prediction_parameters(freq, prediction_length)

view the predictions of the first five in the set of time series, and see how closely they represent the actual values. 

list_of_df = predictor.predict(time_series_training[:60])
actual_data = time_series[:5]

 

for k in range(len(list_of_df)):
plt.figure(figsize=(12,6))
actual_data[k][-prediction_length-context_length:].plot(label=’target’)
p10 = list_of_df[k][‘0.1’]
p90 = list_of_df[k][‘0.9’]
plt.fill_between(p10.index, p10, p90, color=’y’, alpha=0.5, label=’80% confidence interval’)
list_of_df[k][‘0.5′].plot(label=’prediction median’)
plt.legend()
plt.show()

You then can predict any item in the time series data that you’re interested in. This is what the below line is doing. 

print(predictor.predict(time_series[65:66
]))

Finally, once you have your predictions. You need to delete the endpoints to keep from having to pass a massive amount for AWS services. Also go ahead and end the instance of your Juypter notebook to predict charges from that too. 

sagemaker_session.delete_endpoint(endpoint_name) #Save my bank account and run this command whenever you’re done

Lot’s of Code – But what do I need?

Let’s do a brief rundown that’ll make you a machine learning in no time. 

  1. Your data needs to be in some form of granularity, of repeats at certain intervals, all beginning at the same time. Mine is monthly, your’s can be weekly, hourly, etc.
  2. This data needs to be loaded into an S3 bucket with “sagemaker-” at the beginning of the name to allow sagemaker to have access
  3. Load this data into your Juypter notebook through Sagemaker. You can do this by editing my code above or writing your own. You’ll do this with the python package. 
  4. This data will be read in as lists, so we’ll have to format it to a data type that fits AWS’s function. 
  5. Once we have this data read in, the rest is easy, we’ll use AWS’s tools to train a model using our set hyperparameters.
  6. Once the model is trained, it’ll be stored in the same S3 bucket as what our data was stored in.
  7. We then can activate an ‘endpoint’ which is a live tool which we can use for predictions. This is the most cost intensive part of this process, so once you have your end point active, you need to know what time series you want to predict. 
  8. We can run AWS’s predict function on the time series we want using the model we trained.
  9. Once we have the predicted values, we need to terminate the expensive endpoint, and stop the notebook instance.
  10. In the future, you may want to invest time in getting these results written into an excel sheet, rather than just printed in the console. This is something you could do  with xlsxwrite and pandas. 

 

Thanks for making it to the end of this very intensive post! I know it isn’t for everyone, but this is all you need to get up and machine learning with whatever excel data you have present. If you have any questions, feel free to write them below and I’ll try to answer your concerns. 

Enjoy the content? Consider subscribing. 

Leave a Reply

Your email address will not be published. Required fields are marked *