Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Written by

Case Study

Researchers are often looking for interesting real-world problems. One major roadblock they face is real-world data. Here we are trying to serve the research community by providing a real-world problem and a dataset.

This dataset is created from the pulp-and-paper manufacturing industry. Paper manufacturing is a continuous process. A paper machine runs round-the-clock to continuously roll out reels of paper.

However, as smooth as we would like this process to be, we face paper breaks almost every day. Paper sheets are not the strongest material. Due to some adverse changes in process conditions, the paper sheet sometimes breaks (tears).

Whenever a break occurs, the whole machine is stopped. It takes typically more than an hour for the machine to restore. During this downtime, the mill loses more than $10k. Worse than this, an operator is often required to enter hazardous areas to inspect and restore the machine.

In the paper mills, we worked with, on average at least one break occurs every day. This is a major problem for these mills causing yearly losses in the order of millions of dollars and work hazards. Even a 5% reduction in the breaks will bring significant benefits to the mills.

We collected this break data with the purpose of building a model that can predict a break in advance. This can help mill operators prevent them. The key is to predict in advance with small false positives. In the following, we will explain the data and underlying challenges.

The dataset comes from a multivariate time series process. As mentioned before, the data contains a rare event of paper break that commonly occurs in the industry. Although a break happens every day, we still call it a rare event because…

We have high-frequency data measured every two minutes.

For one day, we will have 720 rows. If a break happens once, although the data will have approximately one hour of consecutive rows labeled as break, we will drop all the rows except the first one labeled as a break.

For example, if a break happened at time t and was there until time t+k, we drop rows for time (t+1):(t+k). This is part of data cleaning. As a result, we end up with only a few rows of positively labeled data.

The data contains sensor readings at regular time-intervals (x’s) of 2 mins and the event label (y). The primary purpose of the data is assumed to be building a classification model for early prediction of the rare event (you can think of any other approach). However, it can also be used for multivariate time series data exploration and building other supervised and unsupervised models.

A multivariate time series (MTS) is produced when multiple interconnected streams of data are recorded over time. They are commonly found in manufacturing processes that have several interconnected sensors collecting the data in overtime. In this problem, we have a similar multivariate time series data from a pulp-and-paper industry with a rare event associated with them. It is an unwanted event in the process — a paper break, in our case — that should be prevented. The objective of the problem is to

  • Predict the event before it occurs, and
  • Identify the variables that are expected to cause the event (in order to be able to prevent it)

A typical paper machine is several meters long that ingests raw materials at one end and produces reels of paper as shown in the picture. Several sensors are placed in different parts of the machine along its length and breadth. These sensors measure both raw materials (e.g. amount of pulp fiber, chemicals, etc.) and process variables (e.g. blade type, couch vacuum, rotor speed, etc.).

We encourage researchers to think of this problem. We have built a model for this data and have achieved an f1-score of 0.1. We had several researchers from top universities in the US and abroad work on this and share their results. At the time of publishing this, we are still at an f1-score of 0.1.

Related Articles

Estimating Non-Linear Correlation in R

Estimating Non-Linear Correlation in R

Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc.

Sequence Embedding for Clustering and Classification

Sequence Embedding for Clustering and Classification

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification. Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings.

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Researchers are often looking for interesting real-world problems. One major roadblock they face is real-world data. Here we are trying to serve the research community by providing a real-world problem and a dataset. This dataset is created from the pulp-and-paper manufacturing industry. Paper sheets are not the strongest material.

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 1)

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 1)
Gear Process Icon

Written by

Case Study

A real-world dataset is provided from the pulp-and-paper manufacturing industry. The dataset comes from a multivariate time series process. The data contains a rare event of paper break that commonly occurs in the industry. The data contains sensor readings at regular time-intervals (x’s) and the event label (y).

The primary purpose of the data is thought to be building a classification model for early prediction of a rare event. However, it can also be used for multivariate time series data exploration and building other supervised and unsupervised models.

Problem

A multivariate time series (MTS) is produced when multiple interconnected streams of data are recorded over time. They are commonly found in manufacturing processes that have several interconnected sensors collecting the data in overtime. In this problem, we have a similar multivariate time series data from a pulp-and-paper industry with a rare event associated with them. It is an unwanted event in the process — a paper break, in our case — that should be prevented.

The objective of the problem is to:

  • Predict the event before it occurs, and
  • Identify the variables that are expected to cause the event (in order to be able to prevent it).

Data

We provide data from a pulp-and-paper mill. An example of a paper manufacturing machine is shown above. These machines are typically several meters long that ingests raw materials at one end and produces reels of paper as shown in the picture.

Several sensors are placed in different parts of the machine along its length and breadth. These sensors measure both raw materials (e.g. amount of pulp fiber, chemicals, etc.) and process variables (e.g. blade type, couch vacuum, rotor speed, etc.).

Paper manufacturing can be viewed as a continuous rolling process. During this process, sometimes the paper breaks. If a break happens, the entire process is stopped, the reel is taken out, any found problem is fixed, and the production is resumed. The resumption can take more than an hour. The cost of this lost production time is significant for a mill. Even a 5\% reduction in the break events will give a significant cost saving for a mill.

The objective of the given problem is to predict such breaks in advance (early prediction) and identify the potential cause(s) to prevent the break. To build such a prediction model, we will use the data collected from the network of sensors in a mill. This is a multivariate time series data with a break as the response (a binary variable).

Related Articles

Estimating Non-Linear Correlation in R

Estimating Non-Linear Correlation in R

Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc.

Sequence Embedding for Clustering and Classification

Sequence Embedding for Clustering and Classification

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification. Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings.

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Researchers are often looking for interesting real-world problems. One major roadblock they face is real-world data. Here we are trying to serve the research community by providing a real-world problem and a dataset. This dataset is created from the pulp-and-paper manufacturing industry. Paper sheets are not the strongest material.