Recently I constructed a simulation that modeled the consumable supply system of Marine Corps Logistics. The Marine Corps has several operational-level Supply Management Units (SMU) that warehouse and handle consumable parts orders for the units in their command structure. A consumable part is defined as a part that can’t be repaired, such as a tire, or a brake pad or so forth. These supply units process the hundreds of thousands of consumable part orders that keep the Marine Corps running every year. One of the major challenges in running a supply operation is forecasting your future orders, particularly for a service such as the Marine Corps where we might only need one order per year for a part on systems that don’t have a large fleet size. The old simulation used an average based on quarterly demands. Computing a more robust forecast posed some unique challenges because of the size of the time series data set for the SMUs.

After the simulation project ended, I decided to tackle the forecasting problem, and see if forecasts could be generated in a timely manner. For this particular problem, the consumable part demand records for seven main SMUs were distilled into 172,000 separate two-year long time series analysis sets. The desired end state was to find the “best” time series forecast for each set.

I decided to use the R programming language because of its robust set of time series libraries. Forty-two different analytic methods on each set were attempted. That’s 7.2 million forecasts. I use the word “attempt” here because some of the runs bomb out, due to discontinuities in the time series, too much variability, and so forth. The analytic methods consisted of:

Croston’s forecasting method: Croston’s method is specialized for intermittent/low quantity demands, which are rife in our data set as explained above.

Holt Winters: Useful for seasonality and weighted moving averages. We can’t really do a lot with the seasonality functions, since we only have two years of data.

ARIMA: An Autoregressive Integrated Moving Average Model, with three parameters (p=period,d=differencing degree,q=moving average order). We don’t know what values to pick for these parameters, so we need to look at forty different variations of p, d, and q for each time series, which will eat up the bulk of the processing time. A detailed explanation of periods, differencing degrees, and moving averages is beyond the scope of this article.

What is the * “best”* forecast? From a logistics standpoint, the best forecast is the one that costs the least in warehousing overhead, and fulfills orders at the highest level, commonly called “fill rate” in supply chain jargon. Absent a crystal ball, we’ll have to use statistical methods to determine what

*is. Each of the analytic methods that we’ll be using in R has their own measure of statistical error(s) based on the results and the original time series set. For this model, I decided to go with the measure called Mean Squared Rate (MSR), which is the squared difference between the estimate and the cumulative mean. Nikolaos Kourentzes describes the logic for using MSR on intermittent time series here.*

**best**Analysis and forecasting of the time series sets on a single computer with a four-core CPU (hyperthreading and virtual core use actually improves this to eight cores) computer operating at 2.3 GHz and R version 3.2 took five days. While this included database queries, matrix manipulation, and database inserts, the time involved with these operations was trivial compared to the time series analysis algorithms. Even with an eight-core CPU, if you don’t explicitly **tell** R to use more than one core, you end up getting one-core speed.

The first step in speeding up the analysis involved utilization of the R doParallel library, which allows you to utilize more than one core of the host machine’s CPU. The library requires you to tell it how many cores you want it to utilize, which requires you to query the host machine and ask for the number of cores in the CPU. This can be done with the doParallel command *detectCores()*:

**cores<-detectCores()**

Attempts to use all the cores in the computer resulted in failure, as threads were discontinued by the host operating system when it needed to do housekeeping functions such as disk writes. The remedy for this is to leave one of the cores unused so the operating system has a dedicated core on standby to accomplish its background services. There’s a little bit of discussion on how to juggle your cores on stackoverflow.com, but using all cores always threw an exception when a thread was nulled out and taken over by the operating system.

Another strange thing about using the doParallel library in R is it’s relationship with the R workspace. The libraries, variables, and functions that you load into the default workspace aren’t available to the whatever function you’re sending to doParallel. You have to load the whatever functions and libraries you’re going to need into a “cluster” object. For libraries it looks like this:

**cl <- makeCluster(cores)**

** clusterEvalQ(cl, {library(forecast)})**

** clusterEvalQ(cl, {library(RMySQL)})**

For any functions that are called internally by the function you’re going to run in a parallel computing configuration, you have to load the entire sub-function into a call which is a bit more painful:

**clusterEvalQ(cl, {insert.forecasts <- function(smu_forecasts, con) {**

** # about fifty lines of code…**

** }**

** })**

Then you make the call to * doParallel()* that will start the parallel computation:

**parLapply(cl, transfer_list, forecast.acc,smu)**

This is telling the library, that I want to use my newly created cluster cl. I’m going to use the list transfer_list as the data object. **forecast.acc** is the function that I wrote to create the forecasts. And “smu” is a parameter for **forecast.acc**, a string variable in this case. * doParallel() *is going to apply each element of transfer_list to the

**forecast.acc**in parallel, using the number of cores that you told it to use. After running the parallel processing, you can stop it with:

**stopCluster(cl)**

It was necessary to chunk my dataset and stop then restart the cluster after every chunk that was sent to **doParallel()**, as either it or the R environment was hanging onto memory after the parallel processing was done. Eating up even more performance, was the fact that I had to call * gc()* (garbage collect) after

**stopCluster()**to free up this memory.

Still, parallelization on one machine lowered my computation time from five days to nine hours, but that was still excessive. I have four computers in my network that can be used for cluster computing. Although they were all different machines, they all had Windows operating systems, so that made the task somewhat simpler (although the one machine with Windows Server 2008 R2 was a little harder to set up). While R has cluster computing libraries, I found them a little onerous, and in some cases, network lag can actually make using the cluster slower than just using one machine.

The end result was a hybrid, multi-computing platform, using OpenSSH software to communicate from the master machine to its slaves and start the R scripts on each machine, where they worked on solving the problem with their own dedicated R workspace. Each machine ran parallel processes on its own multi-core CPU, tailored to its configuration. This way the network lag was reduced to almost zero.

Another task was determining the most efficient way to split up the jobs among the four slaved machines. The master machine was also the fastest, so it could handle the most work. After some test runs of the four machines in parallel processing configurations, I determined that the best batching configuration was along the lines of:

Where *μ* is the average processing time for an analysis batch, and *t _{i}* was the amount of time one machine spends on its jobs. In other words, the load is balanced so the difference between the fastest batch processing time and the slowest batch processing time is minimized. Obviously, the average processing time is going to change as different batch configurations are tried on each machine.

This can be set up as a traveling salesman-type heuristic, where each of the computers is like a traveling salesman, where they travel a path to each goal. The path’s distance is the processing time that the computer will take for the batch. The desired outcome is the shortest combined path for all salesmen. Using a variation of Dijkstra’s algorithm to find the shortest path, a heuristic was found to come up with the shortest amount of time needed to build the forecasts.

A maximum job time of 195 minutes and a minimum job time of 112 minutes ended up being the solution. After starting all four machines up, and executing, the time to completion ended up being three hours and fifteen minutes, the longest batch processing time. Compared to the original run time of over eighty hours, the processing time’s been improved well over an order of magnitude.

After running and attempting the 7.2 million forecasts, the database showed that 4.2 million valid forecasting sets for the SMUs had been generated. A simple SQL call picks out the forecast with the minimum Mean Squared Rate for each SMU part, and the job is done. Or almost done, there’s the matter of more than one of the 42 algorithms picking a solution that has the same MSR as another, but in almost all cases that involves two identical forecasts, so picking one of them was trivial. And there’s an involved validation simulation to compare the forecasts to the actual demands, but that’s another long story…