AGRON INFO TECH

Time Series Forecasting for Nile River’s Annual Streamflow Data

In this blog post, we will do exploration of the Nile River’s annual streamflow data using time series analysis. The data, which spans from 1871 to 1970, offers a rich context for understanding the river’s flow patterns over time. We will employ ARIMA models, a popular choice for time series forecasting, to analyze this data. The process involves identifying the optimal parameters for our model, fitting the model to the data, and then generating forecasts. By comparing different ARIMA models, we aim to select the one that best fits our data and provides the most accurate forecasts.

Let’s start!

Introduction

We will use the famous Nile River’s annual streamflow data, conveniently stored in a time series data set named Nile. This dataset goes beyond just numbers; it also includes metadata, providing valuable context. When you call print(Nile), you’ll see details like Start = 1871, indicating the first year of data collection, and End = 1970, marking the final year covered in this specific dataset.

data("Nile")
print(Nile)
## Time Series:
## Start = 1871 
## End = 1970 
## Frequency = 1 
##   [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140  995  935 1110  994 1020
##  [16]  960 1180  799  958 1140 1100 1210 1150 1250 1260 1220 1030 1100  774  840
##  [31]  874  694  940  833  701  916  692 1020 1050  969  831  726  456  824  702
##  [46] 1120 1100  832  764  821  768  845  864  862  698  845  744  796 1040  759
##  [61]  781  865  845  944  984  897  822 1010  771  676  649  846  812  742  801
##  [76] 1040  860  874  848  890  744  749  838 1050  918  986  797  923  975  815
##  [91] 1020  906  901 1170  912  746  919  718  714  740

To make sure the Nile is a time series data you can use is.ts() and class() function to verify this.

is.ts(Nile)
## [1] TRUE
class(Nile)
## [1] "ts"

The plot() function visualizes the historical flow patterns of data. We can observe variations in flow throughout the years, potentially indicating trends or long-term changes.

plot(Nile)

Model Identification

We shall use the two most commonly used functions (acf and pacf) to analyze the time series properties of this data.

par(mfrow = c(1,2))
acf(Nile)
pacf(Nile)

Based on the partial autocorrelation (ACF) plot in the image, it appears that an ARIMA order of (1, 1, 0) might be appropriate. This is because the ACF plot shows a single spike at lag 1, and then dies off afterwards. This suggests that there is an autoregressive term of order 1 (AR(1)) in the model, but no higher order terms are needed. Further we can also see that the autocorrelation values also do not come to zero reasonably quickly, indicates no-stationarity and the series need to be differenced. The autocorrelation (ACF) plot also shows some spikes that cut off after a few lags, but they are not as clear-cut as the PACF plot. This further suggests that a moving average term (MA) is not required in the model (MA(0)).

There is no evidence in the plots to suggest that a seasonal component should be included in the model. The PACF and ACF plots do not exhibit any periodic patterns, which would be indicative of a seasonal component.

Based on these facts we will select ARIMA model of order (1, 1, 0).

Let’s use the auto.arima() function to identify the optimal order for ARIMA model for this time series data. This function will automatically search for the best combination of parameters (p, d, q) based on statistical criteria.

library(forecast)
auto.arima(Nile)
## Series: Nile 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.2544  -0.8741
## s.e.  0.1194   0.0605
## 
## sigma^2 = 20177:  log likelihood = -630.63
## AIC=1267.25   AICc=1267.51   BIC=1275.04

Fitting ARIMA models

While auto.ARIMA suggests ARIMA(1,1,1), it’s wise to compare it to ARIMA(1,1,0) (from the ACF plot) and ARIMA(1,0,0) (no differencing). This will ensure to select the best possible model by comparing their characteristics.

# Fit an AR model to Nile
Auto_fit <- auto.arima(Nile)
AR_fit <- arima(Nile, order = c(1, 0, 0))
ARd_fit <- arima(Nile, order = c(1, 1, 0))

Comparing ARIMA models

We can use statistical tests like AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) to see which model achieves a better fit for this data.

# Define the model fits
fits <- list(Auto_fit, AR_fit, ARd_fit)
fit_names <- c("Auto_fit", "AR_fit", "ARd_fit")

# Initialize an empty data frame
results_df <- data.frame(AIC = numeric(length(fits)), 
                         BIC = numeric(length(fits)), 
                         row.names = fit_names)

# Calculate AIC and BIC for each model and store in the data frame
for (i in seq_along(fits)) {
  results_df[i, "AIC"] <- AIC(fits[[i]])
  results_df[i, "BIC"] <- BIC(fits[[i]])
}

# Print the results data frame
print(results_df)
##               AIC      BIC
## Auto_fit 1267.255 1275.040
## AR_fit   1285.904 1293.720
## ARd_fit  1281.480 1286.671

Both AIC (1267.255) and BIC (1275.04) for Auto_fit are lower than the corresponding values for AR_fit and ARd_fit. Generally, lower AIC and BIC indicate a better fit for the data. While the difference in AIC between Auto_fit and ARd_fit is relatively small (around 14), it still suggests a slight preference for Auto_fit. BIC also favors Auto_fit with a clearer margin.

Generating Forecasts

Given that the ARIMA(1,1,1) model demonstrated a better fit compared to the other two models, we can proceed with generating 10-step forecasts using this model. It’s important to remember that these forecasts are based on the assumption that the underlying patterns in the data will continue to hold for the forecasted period.

While the ARIMA(1,1,1) model will be our primary forecast due to its superior fit, it can also be informative to generate forecasts for the other two models. This comparison can help us visualize the potential range of future values and assess the confidence level in our primary forecast.

# Use predict to make 1-step through 10-step forecasts
Auto_pred <- predict(Auto_fit, n.ahead = 10)
AR_pred <- predict(AR_fit, n.ahead = 10)
ARd_pred <- predict(ARd_fit, n.ahead = 10)

Comparing ARIMA Model Forecasts

This code block generates three visualizations comparing the forecasted Nile River streamflow using different ARIMA models. The par(mfrow = c(3,1)) line sets up a grid layout with three rows and one column to display the plots efficiently.

The ts.plot() line displays the actual Nile River streamflow data (Nile) along with a time axis limited to the years 1871 to 1980 (xlim). The points() function plots the predicted streamflow values (stored in Auto_pred$pred) as a red line.

par(mfrow = c(3,1))
# Predictions of Auto model ARIMA(1, 1, 1)
ts.plot(Nile, xlim = c(1871, 1980), 
        ylab = "Nile River Flow",
        main = "Predictions of Auto_fit model ARIMA(1, 1, 1)")
points(Auto_pred$pred, type = "l", col = "red")
points(Auto_pred$pred - Auto_pred$se, type = "l", col = "red", lty = 2)
points(Auto_pred$pred + Auto_pred$se, type = "l", col = "red", lty = 2)

# Predictions of AR model ARIMA(1, 0, 0)
ts.plot(Nile, xlim = c(1871, 1980), 
        ylab = "Nile River Flow",
        main = "Predictions of AR_fit model ARIMA(1, 0, 0)")
points(AR_pred$pred, type = "l", col = "red")
points(AR_pred$pred - AR_pred$se, type = "l", col = "red", lty = 2)
points(AR_pred$pred + AR_pred$se, type = "l", col = "red", lty = 2)

# Predictions of ARd model ARIMA(1, 1, 0)
ts.plot(Nile, xlim = c(1871, 1980), ylab = "Nile River Flow", 
        main = "Predictions of ARd_fit model ARIMA(1, 1, 0)")
points(ARd_pred$pred, type = "l", col = "red")
points(ARd_pred$pred - ARd_pred$se, type = "l", col = "red", lty = 2)
points(ARd_pred$pred + ARd_pred$se, type = "l", col = "red", lty = 2)
Time Series Forecasting

Here are some observations based on the above predictions:

  • Auto_fit model provided relatively small prediction ranges and standard errors compared to the other two models, indicating potentially more precise predictions.
  • AR_fit model has larger prediction ranges but still reasonable standard errors, suggesting moderate precision.
  • ARd_fit model has the smallest prediction range but also the largest standard errors, indicating potentially less reliable predictions due to higher uncertainty.

We can enhance the visualization of predictions from various models using the ggplot function, enabling more effective comparisons.

# Initialize an empty list to store the data frames
pred_dfs <- list()

# Loop through the models and make forecasts
models <- list(Auto_fit, AR_fit, ARd_fit)
model_names <- c("Auto_pred", "AR_pred", "ARd_pred")
for (i in seq_along(models)) {
  pred <- predict(models[[i]], n.ahead = 10)
  pred_df <- data.frame(
    x = c(time(pred$pred)),
    pred = c(pred$pred),
    se = c(pred$se)
  )
  pred_dfs[[i]] <- cbind(pred_df, model = model_names[i])
}

# Combine the data frames into one
combined_df <- do.call(rbind, pred_dfs)


# Load required libraries
library(ggplot2)

# Plotting
ggplot(
  combined_df, aes(x = x, y = pred, color = model)
) +
  geom_line(linewidth = 1) +
  geom_ribbon(
    aes(ymin = pred - se, ymax = pred + se, fill = model), 
    alpha = 0.3, lty = 2
  ) +
  labs(x = "Year", 
       y = expression("Predicted flow of River Nile (m"^3*")"), 
       title = "Prediction with Standard Error") +
  theme_minimal() +
  scale_x_continuous(breaks = seq(1971, 1980, by = 2))  +
  facet_wrap(facets = ~model) +
  theme(legend.position = "none")

Conclusion

This analysis of the Nile River’s annual streamflow data demonstrated the utility of ARIMA models for time series forecasting. The ARIMA(1,1,1) model was identified as the best fit for the data, outperforming the ARIMA(1,0,0) and ARIMA(1,1,0) models according to AIC and BIC criteria. The forecasts generated by the ARIMA(1,1,1) model were more precise, as indicated by smaller prediction ranges and standard errors. However, it’s important to remember that these forecasts are based on the assumption that the underlying patterns in the data will continue to hold for the forecasted period. This analysis underscores the importance of model selection in time series forecasting and the value of comparing different models to ensure the best possible predictions.

For more details and informative videos you can also subscribe to our YouTube Channel AGRON Info Tech.


Download R program and R studio — Click_here