Sequence models commonly used for sentence classification.

Image credits: https://www.pinterest.com/pin/840765824155900880/

Let’s begin by taking a sentence: “ Today we see a beautiful blue ___?”. What could be the most appropriate word to be filled in the blank ?. “sky,” isn’t it? So when we are trying to predict the sentiment or classify sentences, we try to find the context between the words. Let’s see how we can do this.

I have organized the articles in various categories,

  1. Introduction
  2. Implementation: (preprocessing and the model)
  3. Comparsion between Conv1D model, GRU, Bi-LSTM model
  4. Conclusion

I. INTRODUCTION

Sentence classification can be done typically in two ways:

  1. Bag of words model (BOW) involving approaches like TF-IDF, count vectorizer.
  2. Deep neural network models using LSTMs, GRUs, and ConvNets (Convolutional neural networks).

The BOW model works by treating each word separately and encoding each of the words. For BOW approach, we can use TF-IDF and count vectorizer methods, but these approaches don’t preserve the context of each word in the sentences.

So to achieve better performance for the tasks like sentence classification, sentiment analysis and named entity extraction we use deep neural networks.

In this article, I have used the IMDB-reviews dataset present in the Tensorflow Datasets. The dataset consists of 50000 movie reviews, which are categorized as ‘1’ indicating positive review and ‘0’ indicating negative review.

To train the deep neural models, we need word embeddings as the input. Embeddings are vector representations of words in a higher dimensional plane. Through embeddings, we create a vector representation of the word, which is learned by understanding the context of words. The meaning of the words can come from the labelling of the dataset. We can use pre-trained embeddings like the Glove, Fasttext, which are trained on the huge corpus, or we can create our embeddings (trained on our corpus) using Gensim, Spacy packages, or TensorFlow, Keras APIs.

Preprocessing the data:

The data is loaded and stored into lists: training_sentences and testing sentences. The vocab_size is the number of words in our corpus. It is generally equal to the length of the word_index. The embedding_dim is the number of dimensions into which each word in the movie review gets represented. Max_length is the maximum length of the vector and the out-of-bag-vocabulary tokens generally encountered in testing data are handled by considering them as “<OOV>”.

The In this step, we can also use the Glove word embeddings weight matrix and use those weights directly without training the embedding layer again.

# The hyperparameters required to tune the network.
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
# Tokenizing and padding the sentences into equal length sequences.from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded=pad_sequences(sequences,maxlen=max_length,truncating=trunc_type)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

Now, ‘padded’ is a two-dimensional list containing each encoded sentence as a list. The encoded sentence is pre-padded(by default ‘pre-padding’) with zeros such that the length of all the sentences is equal. Similarly, the testing sentences are tokenized and pre-padded with zeros. The word_index is the dictionary containing all the unique words in the corpus whose key is the word, and the value is the encoded value.

Let’s look at some of the entries in the word_index dictionary.

dict(list(word_index.items())[0:10]){'<OOV>': 1,  'a': 4,  'and': 3,  'br': 8,  'in': 9,  'is': 7,  'it': 10,  'of': 5,  'the': 2,  'to': 6}

Lets look at one of the preprocessed sentence.

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_review(padded[2]))
#print(training_sentences[2])
? ? ? ? ? ? ? b'i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <OOV>

Defining the Bi-directional LSTM:

# Model Definition with LSTM
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

The model summary is as follows:

Model: "sequential_1" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= embedding_1 (Embedding)      (None, 120, 16)           160000     _________________________________________________________________ bidirectional_1 (Bidirection (None, 64)                12544      _________________________________________________________________ dense_2 (Dense)              (None, 6)                 390        _________________________________________________________________ dense_3 (Dense)              (None, 1)                 7          ================================================================= Total params: 172,941 Trainable params: 172,941 Non-trainable params: 0

After 50 epochs, the accuracy I got using Bi-directional LSTM is

num_epochs = 50
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
Epoch 49/50 25000/25000 [==============================] - 392s 16ms/sample - loss: 0.0014 - acc: 0.9996 - val_loss: 1.7093 - val_acc: 0.8162
Epoch 50/50 25000/25000 [==============================] - 406s 16ms/sample - loss: 8.9091e-05 - acc: 1.0000 - val_loss: 1.7495 - val_acc: 0.8153

Now, let’s see how the training accuracy and validation accuracy change using GRUs (Gated Recurrent Units). I have taken 32 Bi-directional GRU cells, and the accuracy is:

Epoch 49/50 25000/25000 [==============================] - 343s 14ms/sample - loss: 3.9704e-08 - acc: 1.0000 - val_loss: 2.5788 - val_acc: 0.8144 
Epoch 50/50 25000/25000 [==============================] - 342s 14ms/sample - loss: 2.6915e-08 - acc: 1.0000 - val_loss: 2.6379 - val_acc: 0.8143

Similarly, I have also tested using one-dimensional convolutional filters of filter length 5 and the number of filters 128. The accuracy is:

Epoch 49/50 25000/25000 [==============================] - 8s 305us/sample - loss: 8.4205e-06 - acc: 1.0000 - val_loss: 2.8729 - val_acc: 0.7946
Epoch 50/50 25000/25000 [==============================] - 8s 305us/sample - loss: 5.2563e-06 - acc: 1.0000 - val_loss: 2.9401 - val_acc: 0.7948

We can see that 1D-convolutional neural nets are much faster than the LSTMs and GRUs, and also validation accuracy is relatively better.

Now let’s see the Learning curves for all the three sequence models with respect to Accuracy and Loss

In the above figure, we can see that the training accuracy reached almost 1.0. But the validation accuracy was better with LSTMs. However, there is clear overfitting of the training data. Because of this, there is a drop in the validation accuracy. Tweaking the hyper-parameters and stacking multiple LSTM layers would still smooth out the curve. Also, using the weights trained on a vast corpus would be a better option and achieve better performances.

Below are the plots for training loss and validation loss of the three models.

From the above graph, we see that LSTMs performed better. The loss by the end 50th epoch is around 1.75. Using 1D-convolutional neural nets proved to be less successful as the loss is high. Model using GRUs achieved good accuracy, but the loss is relatively higher.

Conclusion:

So the neural models that preserve the sequence of text and their context not only perform better in tasks like sentence classification but also in predicting text i.e., creating text. However, the appropriate model to choose depends on the context and application. So in some cases, it might be beneficial to have simple models as compared to complex neural models.

Hope you enjoyed reading….please feel free to provide feedback. Thank you!.

References:

[1]: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[2]: https://www.tensorflow.org/tutorials/text/text_classification_rnn

[3]: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

[4]: Lample, Guillaume, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. “Neural architectures for named entity recognition.” arXiv preprint arXiv:1603.01360 (2016).

[5] Duong, Chi Thang, Remi Lebret, and Karl Aberer. “Multimodal Classification for Analysing Social Media.” The 27th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2017

DataScience Enthusiast