A Primer in Time Series Analysis

Ed warothe
5 min readFeb 2, 2021

How to analyze and predict your date/time data with a few lines of R code

Time series analysis is underrepresented in several machine learning beginner articles. With more automation and powerful algorithms leading the way in time series forecasting, most of these have the same principles at the core. These are connected with Arima(autoregressive moving average) models and decomposition of time series into trend, seasonal and cyclical components. My objective today is to introduce these basics in a way that is easy to understand and apply.

I’ll be using this e-commerce data from Kaggle to extract trends, cycles, and seasonal changes in the data then predict sales for a given amount of time.

First, we load the required packages and data (which is in CSV format).

library(tidyverse) 
library(forecast)
library(tseries)
library(lubridate)
ecom <- read_csv('C:/data.csv', col_types = cols(InvoiceDate = col_datetime(format="%m/%d/%Y %H:%M"), Country=col_factor())) str(ecom)

Does the data have any missing values?

sapply(ecom, function(x) sum(is.na(x)))

Approximately 24% of CustomerID and 2% Description values are missing. Luckily all Invoicedate, Quantity, and Unit price values are present.

Since our analysis is focused on predicting the revenues/total sales, we can easily deduce revenues per row via multiplying the Quantity and Unit price columns. Further, a quick analysis reveals that just over 90% of transactions are from the UK region. Therefore, let's narrow down and predict sales for the UK.***

ecom$totalRevenue <- with(ecom, ecom$Quantity * ecom$UnitPrice) View(ecom) count(ecom, Country) %>% mutate(freq=(n*100)/sum(n))

For the time series analysis, we only need the InvoiceDate and totalRevenue columns. Then we plot a box plot to visualize the outlier distribution for the totalRevenue column.

ecom_uk <- ecom %>% 
filter(Country == "United Kingdom") %>%
select(InvoiceDate, totalRevenue)
boxplot(ecom_uk$totalRevenue)
Outliers in the totalRevenue column

It seems we have negative values for totalRevenue.These are from item returns and bad debt adjustments for the Unit Price column. Since customers are refunded once they return items, we’ll drop the negative values and analyze the rest for now. ***

ecom %>% filter(Quantity < 0) %>% select(Quantity) %>% head()

ecom_uk <- ecom_uk %>% filter(totalRevenue > 0) summary(ecom_uk)

Next, we get the daily average revenue and convert it to a time series object.

ecom_uk <- na.omit(ecom_uk) ecom_ts <- ecom_uk %>% mutate(invdate = floor_date(InvoiceDate, "day")) %>% group_by(invdate) %>% summarise(avg_rev = mean(totalRevenue, na.rm = T)) ecom_ts$avg_rev <- ts(ecom_ts$avg_rev, start = c(2010, 12), end = c(2011, 12), frequency = 304) str(ecom_ts$avg_rev)

Now that we finally have a time series object, analysis using time series methods and visualizations become easier for the algorithms parsing our data. Let’s visualize average daily revenue data using the ggtsdisplay function from the forecast package. This function plots the time series, autocorrelation(ACF), and partial autocorrelation(PACF) plots of the data. The ACF plot reveals any correlation in the lagging values i.e whether subsequent values follow a trend that affects our prediction model. The PACF uses the residuals from the difference in lags to find indirect correlation in the data. The viewer must be warned, however, that for rigor, further statistical tests are required in addition to checking certain assumptions (for example constant variance of the error term, outlier distribution) are catered for and/or controlled.

ggtsdisplay(ecom_ts$avg_rev)
Time series, autocorrelation, and partial autocorrelation plots for the totalRevenue column

It appears there is a pattern at lag 6 and lag 10. This could be from the buying patterns on certain days, say for instance Saturdays.

The next step is prediction using auto-regressive moving average methods(ARIMA). To do this, we have to find out if the data is stationary, that is, it is free of seasonality, trends, and cyclical variations. In simple terms, the data does not depend on the time at all. The Dickey-Fuller test, from the tseries library, is a statistical test suited for this task.

adf.test(ecom_ts$avg_rev, alternative = 'stationary')output:
data: ecom_ts$avg_rev
Dickey-Fuller = -4.2958, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary

The Dickey-Fuller value is significantly less than zero for us to conclude that the data is stationary. Moreover, the function ndiffs supplements this information by telling us the number we should difference our arima model to ensure stationarity.

ndiffs(ecom_ts$avg_rev, alpha = .05, test=c('pp'), type='trend', max.d = 2) # using the PP test which is a modification of the adf test

We then log-transform the data to lessen the outlier effect and fit an Arima function.

fit <- auto.arima(log(ecom_ts$avg_rev), approximation = FALSE, stepwise = FALSE) 
summary(fit)

Now we check whether the residuals of our fitted model resemble white noise, that is, they do not have a pattern.

ggAcf(residuals(fit))
ACF plot for the model residuals

There appears to be no discernible pattern in the above series so we can proceed with prediction.

forecast(fit, h=30) %>% autoplot() + ggtitle('30 Day UK Sales Forecast') + ylab('Average Revenue')

The blue bands are the 80% and 90% confidence interval bands for the prediction. They increase with time.

Since this article is an introduction to time series analysis, further questions to be answered include:

  1. Can we get a more concise picture of spending habits if we disaggregate the data by the revenue/sales, for example, we could split the column by the median and perform a separate analysis for each split.
  2. How do other methods (analysis using Facebook prophet, deep learning methods like LSTM and RNN) stack up against the one we’ve used today taking into account interpretability and complexity.

For more stories, have a look at my website.

Originally published at http://github.com.

--

--

Ed warothe
0 Followers

Ed helps organizations make sense of their data through analytics and visualization with multiple programming languages. Dabbles in the great outdoors also.