codingecho

日々の体験などを書いてます

Multi-categorical text classification with LSTM

I created the prototype of a web application for customer service that uses sequence classification with Keras. This prototype's purpose is to reply the proper response of some categories to our customer are based on the questions customer sent to us. The questions relate to some categories, and then the application predicts to which category a question belongs.

If you are looking for the same situation, this sample might be helpful for you.

You can see the whole source code in GitHub.

Collect text data

Before creating a classification model, collect data set for creating it. Many classification's articles on the internet use the IMDB movie review data set, I think. Instead, I use customer services' question and its categories in our product. I collected this data and store as TSV file.

File format

The format is TSV, and it consists id, question, answer, and the category of question like this:

id question answer category

This raw data set has about 9000 samples. But they include unusable data and have about 15 categories of question.

Load data

Load data from TSV formatted file.

import json
import numpy as np
import csv

issues = []

with open("data/issues.tsv", 'r', encoding="utf-8") as tsv:
    tsv = csv.reader(tsv, delimiter='\t')

    for row in tsv:
        row = []
        row.append(row[1]) # question
        row.append(row[2]) # answer
        row.append(row[3]) # category

        issues.append(row)

Pre-process text

Remove unnecessary characters

These samples are rough for learning. It means that some sample has no question text, and has an e-mail address and symbol like a hyphen. So We have to remove these unnecessary characters.

I removed these with just regular expression and the question which is an empty string like this:

filtered_text = []
text = ["長らくお時間を頂戴しております。version: 1.2.3 ----------------------------------------"]

for t in issues:
    result = re.compile('-+').sub('', t)
    result = re.compile('[0-9]+').sub('0', result)
    result = re.compile('\s+').sub('', result)
    # ... and many regular expression substitutions

    # remove empty string question
    if len(result) > 0:
        sub_texts.append(result)

    filtered_text.append(result)
    print("text:%s" % result)
    # text:長らくお時間を頂戴しております。

Create samples and labels

Create samples and labels from the data set. It has about 15 categories of labels. And I select two label types, 'Account' as two and 'Payment' as three; they are question's categories. And add the other all labels as one which includes the other categories excepts Account, Payment. The samples and labels have to be the same size roughly because LSTM learning wouldn't work well if one of these is more or less. In this case, cap the samples' size it's 700 samples because the payment label has only 688 samples.

Create samples and labels

labels = []
samples = []
threshold = 700
cnt1 = 0
cnt2 = 0
cnt3 = 0

for i, row in enumerate(filtered_samples):
    if 'Account' in row[2]:
        if cnt2 < threashold:
            cnt1 += 1
            labels.append(2)
            samples.append(row[0])
    elif 'Payment' in row[2]:
        if cnt3 < threashold:
            cnt3 += 1
            labels.append(3)
            samples.append(row[0])
    else:
        if cnt1 < threashold:
            cnt1 += 1
            labels.append(1)
            samples.append(row[0])

filtered_samples is what we removed some symbols, e-mail address or something like these from the samples.

Separate the words by MeCab

The questions in the samples written in Japanese. So have to separate words into each word with space. Below is a question text in Japanese:

長らくお時間を頂戴しております

I used MeCab to get space-separated words:

import MeCab
import re

def tokenize(text):
    wakati = MeCab.Tagger("-O wakati")
    wakati.parse("")
    words = wakati.parse(text)

    # Make word list
    if words[-1] == u"\n":
        words = words[:-1]

    return words

texts = [tokenize(a) for a in samples]

This tokenize function returns space-separated words:

長らく お 時間 を 頂戴 し て おり ます

Divde the samples and labels

Divide the samples and labels into training data and validation data:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.utils.np_utils import to_categorical

maxlen = 1000
training_samples = 1600 # training data 80 : validation data 20
validation_samples = len(texts) - training_samples
max_words = 15000

# create word index
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print("Found {} unique tokens.".format(len(word_index)))

data = pad_sequences(sequences, maxlen=maxlen)

# to binary class matrix
categorical_labels = to_categorical(labels)
labels = np.asarray(categorical_labels)

print("Shape of data tensor:{}".format(data.shape))
print("Shape of label tensor:{}".format(labels.shape))

# shuffle indices
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

The data is integer sequese like this:

[0, 0, 0, 10, 5, 24]

Each non-zero integer relates to a word and the zero stands for "empty word." Therefore, this words size is just three and the rest of the sequence will be filled with zero.

Create a model and learn features

I used Keras for learning features. It includes LSTM and Word embedding. LSTM is used for a sequence classification problem, sequence regression problem and so on.

Create a model

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
from keras.layers import LSTM

model = Sequential()
model.add(Embedding(15000, 100, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(4, activation='sigmoid'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

This model learns with LSTM and also word embedding with Embedding(...) at the same time. We can also use pre-trained word embedding instead learning word embedding.

Learn features

Just call model.fit()

history = model.fit(x_train, y_train, epochs=15, batch_size=32, validation_split=0.2, validation_data=(x_val, y_val))

Plot the result

%matplotlib inline

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

The result is like this:

Finally, the validation accuracy becomes about 90 percent.

Save the model

Save the model and weights learned.

model.save('pre_trained_model.h5')

Create a web application

I wanted to use the pre-trained model with a web application. So I used Flask this time because its language is the same as Keras. And this application is simple, receives a text, predicts and then responses its category to the user. This application has a text area, an ask button and the result of a prediction.

Predict a certain question

Before predicting a text, we have to calculate the word index the same as we created for creating the pre-trained model.

app.py

# load the pre traind model
model = load_model('../pre_trained_model.h5')

# we have to pass padded_seq as 2-dimentional array
result = model.predict([padded_seq])

Get the classified result:

np.argmax(res[0])

Please see the whole source code in my repository.

Reference

Deep Learning with Python This book helpful for me!