Here we will learn the details of data preparation for LSTM models, and build an LSTM Autoencoder for rare-event classification. This post is a continuation of my previous post-Extreme Rare Event Classification using Autoencoders. In the previous post, we talked about the challenges in an extremely rare event data with less than 1% positively labeled data.
Dataset: Rare Event Classification in Multivariate Time Series (Pt. 1)
Written by
Case Study
A real-world dataset is provided from the pulp-and-paper manufacturing industry. The dataset comes from a multivariate time series process. The data contains a rare event of paper break that commonly occurs in the industry. The data contains sensor readings at regular time-intervals (x’s) and the event label (y).
The primary purpose of the data is thought to be building a classification model for early prediction of a rare event. However, it can also be used for multivariate time series data exploration and building other supervised and unsupervised models.
Problem
A multivariate time series (MTS) is produced when multiple interconnected streams of data are recorded over time. They are commonly found in manufacturing processes that have several interconnected sensors collecting the data in overtime. In this problem, we have a similar multivariate time series data from a pulp-and-paper industry with a rare event associated with them. It is an unwanted event in the process — a paper break, in our case — that should be prevented.
The objective of the problem is to:
- Predict the event before it occurs, and
- Identify the variables that are expected to cause the event (in order to be able to prevent it).
Data
We provide data from a pulp-and-paper mill. An example of a paper manufacturing machine is shown above. These machines are typically several meters long that ingests raw materials at one end and produces reels of paper as shown in the picture.
Several sensors are placed in different parts of the machine along its length and breadth. These sensors measure both raw materials (e.g. amount of pulp fiber, chemicals, etc.) and process variables (e.g. blade type, couch vacuum, rotor speed, etc.).
Paper manufacturing can be viewed as a continuous rolling process. During this process, sometimes the paper breaks. If a break happens, the entire process is stopped, the reel is taken out, any found problem is fixed, and the production is resumed. The resumption can take more than an hour. The cost of this lost production time is significant for a mill. Even a 5\% reduction in the break events will give a significant cost saving for a mill.
The objective of the given problem is to predict such breaks in advance (early prediction) and identify the potential cause(s) to prevent the break. To build such a prediction model, we will use the data collected from the network of sensors in a mill. This is a multivariate time series data with a break as the response (a binary variable).
Related Articles
LSTM Autoencoder for Extreme Rare Event Classification in Keras
Extreme Rare Event Classification using Autoencoders in Keras
In a rare-event problem, we have an unbalanced dataset. Meaning, we have fewer positively labeled samples than negative. In a typical rare-event problem, the positively labeled data are around 5–10% of the total. In an extremely rare event problem, we have less than 1% positively labeled data.
Estimating Non-Linear Correlation in R
Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc.