Sequence Embedding for Clustering and Classification

Sequence Embedding for Clustering and Classification

Written by

Case Study

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification.

Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings. For a computer, these strings have no meaning. As a result, building a data mining model is difficult.

For texts, we have come up with embeddings, such as word2vec, that convert a word into an n-dimensional vector. Essentially, bringing it in a Euclidean space. In this post, we will learn to do the same for sequences.

Here we will go over an approach to create embeddings for sequences that bring a sequence in a Euclidean space. With these embeddings, we can perform conventional Machine Learning and Deep Learning, e.g. K-means, PCA, and Multi-Layer Perceptron on sequence datasets. We provide and work on two datasets — protein sequences and weblogs.

Sequence datasets are commonly found around us. For example, clickstreams, music listening history, and weblogs in tech industries. In BioInformatics, we have large databases of protein sequences. A protein sequence is made of some combination of 20 amino acids. A typical protein sequence is shown where each letter corresponds to an amino acid.

A protein sequence does not necessarily contain all the 20 amino acids but some subset of it.

For clarity, we will define some keywords used in this post.

  • alphabet: the discrete elements that make up a sequence. E.g. an amino acid.
  • alphabet-set: a set of all alphabets that will make sequences in a corpus. E.g. all protein sequences in a corpus are made of a set of 20 amino acids.
  • sequence: an ordered series of discrete alphabets. A sequence in a corpus contains a subset of alphabet-set.

The sequence corpus typically contains thousands to millions of sequences. Clustering and Classification are often required given we have labeled or unlabeled data. However, doing this is not straightforward due to the un-structuredness of sequences — arbitrary strings of arbitrary length.

To overcome this, sequence embeddings can be used. Here we will use an SGT embedding that embeds the long- and short- term patterns in a sequence into a finite-dimensional vector. The advantage of SGT embedding is that we can easily tune the amount of long- / short- term patterns without increasing the computation.

The source code and data in the following is . Before moving forward, we will need to install Sgt package:

sgt

$ pip install sgt

Clustering

Protein Sequence Clustering

The data used here is taken from . This is a public database for proteins. The data contains the protein sequences and their function. In this section, we will cluster the protein sequences, and in the next we will use their functions as labels for building a classifier.

We first read the sequence data, and convert it into a list of lists. As shown below, each sequence is a list of alphabets.

>>> protein_data = pd.DataFrame.from_csv('../data/protein_classification.csv')
>>> X = protein_data['Sequence']
>>> def split(word): 
>>>    return [char for char in word]>>> sequences = [split(x) for x in X]
>>> print(sequences[0])
['M', 'E', 'I', 'E', 'K', 'T', 'N', 'R', 'M', 'N', 'A', 'L', 'F', 'E', 'F', 'Y', 'A', 'A', 'L', 'L', 'T', 'D', 'K', 'Q', 'M', 'N', 'Y', 'I', 'E', 'L', 'Y', 'Y', 'A', 'D', 'D', 'Y', 'S', 'L', 'A', 'E', 'I', 'A', 'E', 'E', 'F', 'G', 'V', 'S', 'R', 'Q', 'A', 'V', 'Y', 'D', 'N', 'I', 'K', 'R', 'T', 'E', 'K', 'I', 'L', 'E', 'D', 'Y', 'E', 'M', 'K', 'L', 'H', 'M', 'Y', 'S', 'D', 'Y', 'I', 'V', 'R', 'S', 'Q', 'I', 'F', 'D', 'Q', 'I', 'L', 'E', 'R', 'Y', 'P', 'K', 'D', 'D', 'F', 'L', 'Q', 'E', 'Q', 'I', 'E', 'I', 'L', 'T', 'S', 'I', 'D', 'N', 'R', 'E']

Next, we generate the sequence embeddings.

>>> from sgt import Sgt
>>> sgt = Sgt(kappa = 10, lengthsensitive = False)
>>> embedding = sgt.fit_transform(corpus=sequences)

The embedding is in 400-dimensional space. Let’s first do PCA on it and reduce the dimension to two. This will also help visualize the clusters.

>>> pca = PCA(n_components=2)
>>> pca.fit(embedding)
>>> X = pca.transform(embedding)>>> print(np.sum(pca.explained_variance_ratio_))
0.6019403543806409

The top two PCs are explaining about 60% of the variance. We will cluster them into 3 clusters.

>>> kmeans = KMeans(n_clusters=3, max_iter =300)
>>> kmeans.fit(df)>>> labels = kmeans.predict(df)
>>> centroids = kmeans.cluster_centers_>>> fig = plt.figure(figsize=(5, 5))
>>> colmap = {1: 'r', 2: 'g', 3: 'b'}
>>> colors = list(map(lambda x: colmap[x+1], labels))
>>> plt.scatter(df['x1'], df['x2'], color=colors, alpha=0.5, edgecolor=colors)

Moving on to building classifiers.

Classification

Protein Sequence Classification

We will start by building a classifier on the same protein dataset we used earlier. The proteins in the dataset have two functions. Therefore, we will build a binary classifier.

We will first convert the:

function [CC]

a column in the data into labels that can be ingested in an MLP model built-in

keras
>>> y = protein_data['Function [CC]']
>>> encoder = LabelEncoder()
>>> encoder.fit(y)
>>> encoded_y = encoder.transform(y)

In the following, we build the MLP classifier and run 10-fold cross-validation.

>>> kfold = 10
>>> X = pd.DataFrame(embedding)
>>> y = encoded_y>>> random_state = 1>>> test_F1 = np.zeros(kfold)
>>> skf = KFold(n_splits = kfold, shuffle = True, random_state = random_state)
>>> k = 0
>>> epochs = 50
>>> batch_size = 128>>> for train_index, test_index in skf.split(X, y):
>>>     X_train, X_test = X.iloc[train_index], X.iloc[test_index]
>>>     y_train, y_test = y[train_index], y[test_index]
>>>     X_train = X_train.as_matrix(columns = None)
>>>     X_test = X_test.as_matrix(columns = None)
>>>     
>>>     model = Sequential()
>>>     model.add(Dense(64, input_shape = (X_train.shape[1],), init = 'uniform')) 
>>>     model.add(Activation('relu'))
>>>     model.add(Dropout(0.5))
>>>     model.add(Dense(32, init='uniform'))
>>>     model.add(Activation('relu'))
>>>     model.add(Dropout(0.5))
>>>     model.add(Dense(1, init='uniform'))
>>>     model.add(Activation('sigmoid'))
>>>     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
>>>     
>>>     model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=0)
>>>     
>>>     y_pred = model.predict_proba(X_test).round().astype(int)
>>>     y_train_pred = model.predict_proba(X_train).round().astype(int)>>>     test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
>>>     k+=1
    
>>> print ('Average test f1-score', np.mean(test_F1))
Average test f1-score 1.0

This data turns out to be too good for the classifier. We have another dataset that is more challenging. Let’s go over it.

Network Log Data Classification

This data sample is taken from . This is a network intrusion data containing audit logs and any attack as a positive label. Since network intrusion is a rare event, the data is unbalanced. Additionally, we have a small dataset of only 111 records. Here we will build a sequence classification model to predict a network intrusion.

Each sequence contains in the data is a series of activity, for example, 

{login, password, …}

The alphabets in the input data sequences are already encoded into integers. The original sequences data file is present .

Similar to before, we will first prepare the data for a classifier.

>>> darpa_data = pd.DataFrame.from_csv('../data/darpa_data.csv')
>>> X = darpa_data['seq']
>>> sequences = [x.split('~') for x in X]>>> y = darpa_data['class']
>>> encoder = LabelEncoder()
>>> encoder.fit(y)
>>> y = encoder.transform(y)

In this data, the sequence embeddings should be length-sensitive. The lengths are important here because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: 

{login, pswd, login, pswd,…}
{login, pswd,…(repeated several times)…, login, pswd}

While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns.

>>> sgt_darpa = Sgt(kappa = 5, lengthsensitive = True)
>>> embedding = sgt_darpa.fit_transform(corpus=sequences)

The embedding we find here is sparse. Therefore, we will perform dimension reduction using PCA before we train a classifier.

>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=35)
>>> pca.fit(embedding)
>>> X = pca.transform(embedding)
>>> print(np.sum(pca.explained_variance_ratio_))
0.9862350164327149

The selected top-35 PCAs are explaining more than 98% of the variance. We will now go ahead and build a Multi-Layer Perceptron using keras. Since the data size is small and, also, the number of positive labeled points, we will perform a 3-fold validation.

>>> kfold = 3
>>> random_state = 11>>> test_F1 = np.zeros(kfold)
>>> time_k = np.zeros(kfold)
>>> skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
>>> k = 0
>>> epochs = 300
>>> batch_size = 15>>> class_weight = {0 : 0.12, 1: 0.88,}  # The weights can be changed and made inversely proportional to the class size to improve the accuracy.>>> for train_index, test_index in skf.split(X, y):
>>>     X_train, X_test = X[train_index], X[test_index]
>>>     y_train, y_test = y[train_index], y[test_index]
>>>     
>>>     model = Sequential()
>>>     model.add(Dense(128, input_shape=(X_train.shape[1],), init='uniform')) 
>>>     model.add(Activation('relu'))
>>>     model.add(Dropout(0.5))
>>>     model.add(Dense(1, init='uniform'))
>>>     model.add(Activation('sigmoid'))
>>>     model.summary()
>>>     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
>>>    
>>>     start_time = time.time()
>>>     model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=1, class_weight=class_weight)
>>>     end_time = time.time()
>>>     time_k[k] = end_time-start_time>>>     y_pred = model.predict_proba(X_test).round().astype(int)
>>>     y_train_pred = model.predict_proba(X_train).round().astype(int)
>>>     test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
>>>     k += 1
>>> print ('Average Test f1-score', np.mean(test_F1))
Average Test f1-score 0.5236467236467236>>> print ('Average Run time', np.mean(time_k))
Average Run time 9.076935768127441

This was a difficult data to classify. To have a loose benchmark, let’s build a fancier LSTM classifier on the same data.

>>> X = darpa_data['seq']
>>> encoded_X = np.ndarray(shape=(len(X),), dtype=list)
>>> for i in range(0,len(X)):
>>>     encoded_X[i]=X.iloc[i].split("~")
>>> max_seq_length = np.max(darpa_data['seqlen'])
>>> encoded_X = sequence.pad_sequences(encoded_X, maxlen=max_seq_length)>>> kfold = 3
>>> random_state = 11>>> test_F1 = np.zeros(kfold)
>>> time_k = np.zeros(kfold)>>> epochs = 50
>>> batch_size = 15
>>> skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
>>> k = 0>>> for train_index, test_index in skf.split(encoded_X, y):
>>>     X_train, X_test = encoded_X[train_index], encoded_X[test_index]
>>>     y_train, y_test = y[train_index], y[test_index]
>>> 
>>>     embedding_vecor_length = 32
>>>     top_words=50
>>>     model = Sequential()
>>>     model.add(Embedding(top_words, embedding_vecor_length, input_length=max_seq_length))
>>>     model.add(LSTM(32))
>>>     model.add(Dense(1, init='uniform'))
>>>     model.add(Activation('sigmoid'))
>>>     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])>>>     start_time = time.time()
>>>     model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1)
>>>     end_time=time.time()
>>>     time_k[k]=end_time-start_time>>>     y_pred = model.predict_proba(X_test).round().astype(int)
>>>     y_train_pred=model.predict_proba(X_train).round().astype(int)
>>>     test_F1[k]=sklearn.metrics.f1_score(y_test, y_pred)
>>>     k+=1>>> print ('Average Test f1-score', np.mean(test_F1))
Average Test f1-score 0.0>>> print ('Average Run time', np.mean(time_k))
Average Run time 425.68603706359863

We find that the LSTM classifier gives an F1 score of 0. This may be improved by changing the model. However, we find that the SGT embedding could work with small and unbalanced data without the need of a complicated classifier. Besides, on the runtime, SGT embedded network was significantly faster. It took on average 9.1 secs while the LSTM model took 425.6 secs.

Concluding Remarks

  • We learned about using a sequence embedding for sequence clustering and classification.
  • This embedding is an implementation of this . Not covered in this post, but refer to this paper to see the accuracy comparison of SGT embedding over others.
  • Due to SGT embedding’s ability to capture long- and short- term patterns, it works better than most other sequence modeling approaches.
  • It is recommended that you play with the tuning parameter, kappa, to see its effect on accuracy.

Related Articles

Extreme Rare Event Classification using Autoencoders in Keras

Extreme Rare Event Classification using Autoencoders in Keras

In a rare-event problem, we have an unbalanced dataset. Meaning, we have fewer positively labeled samples than negative. In a typical rare-event problem, the positively labeled data are around 5–10% of the total. In an extremely rare event problem, we have less than 1% positively labeled data.

Estimating Non-Linear Correlation in R

Estimating Non-Linear Correlation in R

Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc.

Sequence Embedding for Clustering and Classification

Sequence Embedding for Clustering and Classification

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification. Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings.

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 2)

Written by

Case Study

Researchers are often looking for interesting real-world problems. One major roadblock they face is real-world data. Here we are trying to serve the research community by providing a real-world problem and a dataset.

This dataset is created from the pulp-and-paper manufacturing industry. Paper manufacturing is a continuous process. A paper machine runs round-the-clock to continuously roll out reels of paper.

However, as smooth as we would like this process to be, we face paper breaks almost every day. Paper sheets are not the strongest material. Due to some adverse changes in process conditions, the paper sheet sometimes breaks (tears).

Whenever a break occurs, the whole machine is stopped. It takes typically more than an hour for the machine to restore. During this downtime, the mill loses more than $10k. Worse than this, an operator is often required to enter hazardous areas to inspect and restore the machine.

In the paper mills, we worked with, on average at least one break occurs every day. This is a major problem for these mills causing yearly losses in the order of millions of dollars and work hazards. Even a 5% reduction in the breaks will bring significant benefits to the mills.

We collected this break data with the purpose of building a model that can predict a break in advance. This can help mill operators prevent them. The key is to predict in advance with small false positives. In the following, we will explain the data and underlying challenges.

The dataset comes from a multivariate time series process. As mentioned before, the data contains a rare event of paper break that commonly occurs in the industry. Although a break happens every day, we still call it a rare event because…

We have high-frequency data measured every two minutes.

For one day, we will have 720 rows. If a break happens once, although the data will have approximately one hour of consecutive rows labeled as break, we will drop all the rows except the first one labeled as a break.

For example, if a break happened at time t and was there until time t+k, we drop rows for time (t+1):(t+k). This is part of data cleaning. As a result, we end up with only a few rows of positively labeled data.

The data contains sensor readings at regular time-intervals (x’s) of 2 mins and the event label (y). The primary purpose of the data is assumed to be building a classification model for early prediction of the rare event (you can think of any other approach). However, it can also be used for multivariate time series data exploration and building other supervised and unsupervised models.

A multivariate time series (MTS) is produced when multiple interconnected streams of data are recorded over time. They are commonly found in manufacturing processes that have several interconnected sensors collecting the data in overtime. In this problem, we have a similar multivariate time series data from a pulp-and-paper industry with a rare event associated with them. It is an unwanted event in the process — a paper break, in our case — that should be prevented. The objective of the problem is to

  • Predict the event before it occurs, and
  • Identify the variables that are expected to cause the event (in order to be able to prevent it)

A typical paper machine is several meters long that ingests raw materials at one end and produces reels of paper as shown in the picture. Several sensors are placed in different parts of the machine along its length and breadth. These sensors measure both raw materials (e.g. amount of pulp fiber, chemicals, etc.) and process variables (e.g. blade type, couch vacuum, rotor speed, etc.).

We encourage researchers to think of this problem. We have built a model for this data and have achieved an f1-score of 0.1. We had several researchers from top universities in the US and abroad work on this and share their results. At the time of publishing this, we are still at an f1-score of 0.1.

Related Articles

Extreme Rare Event Classification using Autoencoders in Keras

Extreme Rare Event Classification using Autoencoders in Keras

In a rare-event problem, we have an unbalanced dataset. Meaning, we have fewer positively labeled samples than negative. In a typical rare-event problem, the positively labeled data are around 5–10% of the total. In an extremely rare event problem, we have less than 1% positively labeled data.

Estimating Non-Linear Correlation in R

Estimating Non-Linear Correlation in R

Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc.

Sequence Embedding for Clustering and Classification

Sequence Embedding for Clustering and Classification

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification. Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings.

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 1)

Dataset: Rare Event Classification in Multivariate Time Series (Pt. 1)
Gear Process Icon

Written by

Case Study

A real-world dataset is provided from the pulp-and-paper manufacturing industry. The dataset comes from a multivariate time series process. The data contains a rare event of paper break that commonly occurs in the industry. The data contains sensor readings at regular time-intervals (x’s) and the event label (y).

The primary purpose of the data is thought to be building a classification model for early prediction of a rare event. However, it can also be used for multivariate time series data exploration and building other supervised and unsupervised models.

Problem

A multivariate time series (MTS) is produced when multiple interconnected streams of data are recorded over time. They are commonly found in manufacturing processes that have several interconnected sensors collecting the data in overtime. In this problem, we have a similar multivariate time series data from a pulp-and-paper industry with a rare event associated with them. It is an unwanted event in the process — a paper break, in our case — that should be prevented.

The objective of the problem is to:

  • Predict the event before it occurs, and
  • Identify the variables that are expected to cause the event (in order to be able to prevent it).

Data

We provide data from a pulp-and-paper mill. An example of a paper manufacturing machine is shown above. These machines are typically several meters long that ingests raw materials at one end and produces reels of paper as shown in the picture.

Several sensors are placed in different parts of the machine along its length and breadth. These sensors measure both raw materials (e.g. amount of pulp fiber, chemicals, etc.) and process variables (e.g. blade type, couch vacuum, rotor speed, etc.).

Paper manufacturing can be viewed as a continuous rolling process. During this process, sometimes the paper breaks. If a break happens, the entire process is stopped, the reel is taken out, any found problem is fixed, and the production is resumed. The resumption can take more than an hour. The cost of this lost production time is significant for a mill. Even a 5\% reduction in the break events will give a significant cost saving for a mill.

The objective of the given problem is to predict such breaks in advance (early prediction) and identify the potential cause(s) to prevent the break. To build such a prediction model, we will use the data collected from the network of sensors in a mill. This is a multivariate time series data with a break as the response (a binary variable).

Related Articles

Extreme Rare Event Classification using Autoencoders in Keras

Extreme Rare Event Classification using Autoencoders in Keras

In a rare-event problem, we have an unbalanced dataset. Meaning, we have fewer positively labeled samples than negative. In a typical rare-event problem, the positively labeled data are around 5–10% of the total. In an extremely rare event problem, we have less than 1% positively labeled data.

Estimating Non-Linear Correlation in R

Estimating Non-Linear Correlation in R

Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc.

Sequence Embedding for Clustering and Classification

Sequence Embedding for Clustering and Classification

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification. Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings.