AWS Sagemaker – predicting gasoline monthly output

AWS continues to wow me with all of the services that they are coming out with. What Amazon is doing is a very smart strategy. They are leveraging their technology stack to build more advanced solutions. In doing so, Amazon Web Services is following the “Profit From The Core” strategy down to the t.  Aside from following Amazon’s world domination plan, I wanted to see how well their roll out of artificial intelligence tools, like Sagemaker, went.


There are many articles about how AI works.  In some cases, an application is extraordinarily simple.  In other cases, it is endlessly complex. We are going to stick with the most simple model.  In this model, we have to do the following steps.

  1. Collect data
  2. Clean Data
  3. Build Model
  4. Train Model
  5. Predict Something

Amazon has tried to automate these steps as best as possible.   From Amazon’s site: “Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.”

Lets see how well they do.  Gentle people…lets start our clocks.  The time is 20 May 2018 @ 6:05pm.

Notebook Instances

The first thing that you do as part of your training is build notebooks. According to Jupyter, the developer of Project Jupyter, a notebook is an application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

You follow the simple tutorial and it looks something like this.

AWS Sage simple Jupyter Notebook

Time: 6:11:34 (so far so good)

Example Selection – Time Series Forecast

The first thing that we want to do is go to the “SageMaker Examples” tab, and make a copy of “linear_time_series_forecast_2019-05-20”.  I have had some experience predicting when events would happen and wanted to follow something that I already know. If you aren’t familiar, please check out this coursera video.

Time: 6:20:17

Read Background

Forecasting is potentially the most broadly relevant machine learning topic there is. Whether predicting future sales in retail, housing prices in real estate, traffic in cities, or patient visits in healthcare, almost every industry could benefit from improvements in their forecasts. There are numerous statistical methodologies that have been developed to forecast time-series data. However, the process for developing forecasts tends to be a mix of objective statistics and subjective interpretations.

Properly modeling time-series data takes a great deal of care. What’s the right level of aggregation to model at? Too granular and the signal gets lost in the noise, too aggregate and important variation is missed. Also, what is the right cyclicality? Daily, weekly, monthly? Are there holiday peaks? How should we weight recent versus overall trends?

Linear regression with appropriate controls for trend, seasonality, and recent behavior, remains a common method for forecasting stable time-series with reasonable volatility. This notebook will build a linear model to forecast weekly output for US gasoline products starting in 1991 to 2005. It will focus almost exclusively on the application. For a more in-depth treatment on forecasting in general, see Forecasting: Principles & Practice. In addition, because our dataset is a single time-series, we’ll stick with SageMaker’s Linear Learner algorithm. If we had multiple, related time-series, we would use SageMaker’s DeepAR algorithm, which is specifically designed for forecasting. See the DeepAR Notebook for more detail.

Time: 6:24:13

S3 Setup

Let’s start by specifying:

  • The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
  • The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

I set up a simple s3 bucket like this: 20180520-sage-test-v1-tm

Import the Python libraries.

Got distracted and played with all of the functions.  Time 6:38:07.


Let’s download the data. More information about this dataset can be found here.

You can run some simple plots using Matlab and Pandas.

Sage time series gas plots


Transform Data To Predictive Model

Next we’ll transform the dataset to make it look a bit more like a standard prediction model.

This stage doesn’t look immediately clear. If you were to just click through the buttons, it takes a few seconds. If you want to read through these stages, it will take you a lot longer. In the end, you should have the following files stored on S3.

Note, you can’t review the content from these using a text editor. The data is stored in binary.

Time: 7:02:43

I normally don’t use a lot of notebooks. As a result, this took a little longer because I ran into some problems.


Amazon SageMaker’s Linear Learner actually fits many models in parallel. Each model has slightly different hyper-parameters. The model the best fit is the one used. This functionality is automatically enabled. We can influence this using parameters like:

  • num_models to increase the total number of models run. The specified parameters values, will always be in those models. However, the algorithm also chooses models with nearby parameter values. This is in case a nearby solution is more optimal. In this case, we’re going to use the max of 32.
  • loss which controls how we penalize mistakes in our model estimates. For this case, let’s use absolute loss. We haven’t spent much time cleaning the data. Therefore, absolute loss will adjust less to accommodate outliers.
  • wd or l1 which control regularization. Regularization helps prevent model overfitting. It works by preventing our estimates from becoming too finely tuned to the training data. This is why it is good to make sure your training data is an appropriate sample of the entire data set. In this case, we’ll leave these parameters as their default “auto”.

This part of the demo took a lot longer….

And it worked!

Ended at time: 7:21:54 pm.


The Forecast!

This is what we have all been waiting for!

For our example we’ll keep things simple and use Median Absolute Percent Error (MdAPE), but we’ll also compare it to a naive benchmark forecast (that week last year’s demand * that week last year / that week two year’s ago).

As we can see our MdAPE is substantially better than the naive. Additionally, we actually swing from a forecast that is too volatile to one that under-represents the noise in our data. However, the overall shape of the statistical forecast does appear to better represent the actual data.

Next, let’s generate a multi-step-ahead forecast. To do this, we’ll need to loop over invoking the endpoint one row at a time and make sure the lags in our model are updated appropriately.



It does appear that for pre-built scenarios that AWS’s Sagemaker worked for linear time series prediction!  While it doesn’t make you a master data scientist, it does however give you a simple place to train and practice with data sets.  If you wanted to master time series, you could simply plug in other datasets and conduct the same sort of analysis and cross check your work with other people’s results.  With Sagemaker, you have a complete and working blueprint!

Wrap up time: 8:19:30pm (with some distractions and breaks)