An Unconventional NER

A wonderful serenity has taken possession of my entire soul, like these sweet mornings of spring which I enjoy with my whole heart. I am alone, and feel the charm of existence in this spot, which was created for the bliss of souls like mine. I am so happy, my dear friend, so absorbed in the exquisite sense of mere tranquil existence, that I neglect my talents. I should be incapable of drawing a single stroke at the present moment; and yet I feel that I never was a greater artist than now.

So, everyone is always excited to get something like Alexa or Amazon Echoto the home. Sounds exciting, is exciting. We, sometimes find our entertainment through them and ask them awkward questions and find how they answer. Same goes with Siri. Or, how many of you guys here know about J.A.R.V.I.S. . Yeah, the one who always accompanies the Iron Man. Exciting right? All of them have one thing in common. They know how to understand the intent of what we are saying (Intent Recognition: I will post about it later), and recognises the entities in that sentence.

Example: If you ask Alexa: “What is the capital of America?”, it recognises the words “capital” and “America” as entities. Then again, if you ask Amazon Echo, “Hey, play Girls like me by Maroon 5” for me, it recognises “Girls like me” as a song or a video or a movie and “Maroon 5” as the name of a person and on the context of the sentence, as an artist.

All of this happens through Named Entity Recognition (NER).


What is NER?

Named Entity Recognition (NER) is one of the very useful information extraction techniques to identify and classify named entities in text. These entities are pre-defined categories such as a person’s name, organizations, locations, time representations, financial elements, etc.

Apart from there generic entities, there could be other specific terms that could be defined given a particular problem. These terms represent elements which have a unique context compared to the rest of the text. For example, it could be anything like operating systems, programming languages, football league team names, etc. The machine learning models could be trained to categorize such as custom entities which are usually denoted by proper names and therefore are mostly noun phrases in text documents.

NER Categories

Broadly, NER has three top-level categorisations — entity names, temporal expressions and number expressions:

I. Entity Names represent the identity of an element, for example, name of a person, title, organisation, any living or non living thing, etc.

II. A Temporal Expression is some sequence of words with time related elements for example, calendar dates, times of day, duration, etc.

III. A Numerical Expression is a mathematical sentence involving only numbers and/or operation symbols. It could depict financial numbers, tangible entities, mathematical expressions, etc.

Now, there are a lot of ways to do NER and that includes CRF (Conditional Random Fields), Sequence Tagging using LSTM, BiLSTM with CRF, Residual LSTM or ELMo (State of the Art).

If you want to go through those first, here it is:


A bit Unconventional Approach

What if I said you could avoid all those 1024-D ELMo embeddings (computationally expensive) and rather use a simple Binary Classificationapproach. So, let’s get to it and see how it works:

Honestly, I have very less contextual sentence data right now, so I created a very small data set of a very few sentences. I am giving you guys the link but you should really try it with a bulk data set while I create one.

The data is based on the BILOU tagging scheme.

So, let’s get stated for the code.

import numpy as np
import warnings
import random
import re
import os.path
from os import path
import pickle
import math
import json

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
train_data = 'text.json'
with open(train_data) as file:
  data = json.load(file)
train_data = data['sentence_data']['examples']

Now, we have loaded the data. Now, let’s get the sentences and the entities present.

data_list = []
seq_present = 0
for element in train_data:
  list1 = dict()
  query = element['text'].lower()
  length = len(element['slots'])
  seq_present += length
  for e in element['slots']:
    chunk = query[e['start']:e['end']]
    list1[chunk] = e['slot']
    list1[chunk] = e['slot']
  if len(element['slots']) == 0:
    list1[query] = ''
  data_list.append(list1)
print(data_list)
print(seq_present)

So, you can see 19 sequences (parts of the sentences) which have an entity attached to it. Example: “germany”: “country”, “porsche”: “company”.


entities = []
for element in data_list:
  for key, value in element.items():
    if value not in entities and value != '':
      entities.append(value)
entities.sort()
print(entities)
print(len(entities))

So, we see that we have a total of 10 entities in the data. So, now let’s make a One Hot Encoding out of these entities.

ents = np.asarray(entities)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(ents)
onehot_encoder = OneHotEncoder(sparse = False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_entities = onehot_encoder.fit_transform(integer_encoded)
print(onehot_entities)

It will be of the shape: (10, 10)

print(onehot_entities)

Here, each one hot encoding represents an entity.

Now, lets load our pre trained word embeddings. For this small data, I have used the Stanford NLP gloVE 300B 50-D word embeddings. But, I would recommend you to use at least go for the gloVE 840B 300-D embeddings. A much better option would be to use Character Embeddings (from Facebook fastText).

embeddings_index = dict()
f = open('glove.6B.50d.txt')
for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray((values[1:]), np.float32)
  embeddings_index[word] = coefs
f.close()

Now, let’s get the queries present:

queries = []
for element in train_data:
  queries.append(element['text'].lower())
print(queries)

As you see, small data set man. One bad thing. Data is the fuel behind all this and I am not even following the principal rule now.

Now, let’s tokenize the sentences.

t = Tokenizer(filters='!"#%&()*+,-.:;<=>?@[\]^`{|}~ ')
t.fit_on_texts(queries)
print(t.word_index)

As, you can see, I have got $80000 as a separate token which should be ‘$’ and ‘80000’ separately. So, I have used the filters properly according to my use. I have removed the $ and / (for dates) from the filters.

Now, let’s have a look at the unknown words:

unk_words = []
for word, i in t.word_index.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is None:
    unk_words.append(word)
print(unk_words)
print(len(unk_words))

We get 5 unknown words which are not present in the gloVE dictionary.

Now, there are certain words (apr, 7/1/2014, etc.)numbers (>1000)amounts ($500 or 525$, etc.) in the data whose vector values are not present in the gloVE data. So, we will be assigning random vectors to them in 3 categories:

  1. unk_vec_for_word (for unknown words)
  2. unk_vec_for_num (for unknown numbers)
  3. unk_vec_for_amt (for amounts)

You can also assign something like an unknown vector for dates or other unique stuff as per your data.

vectors_list = []
for key, value in embeddings_index.items():
  vectors_list.append(value)
vec1 = random.choice(vectors_list)
vec2 = random.choice(vectors_list)
unk_vec_for_word = (vec1+vec2)/2
vec1 = random.choice(vectors_list)
vec2 = random.choice(vectors_list)
unk_vec_for_amt = (vec1+vec2)/2
print(unk_vec_for_word)
print(unk_vec_for_amt)

I got the necessary unknown vectors.

Now, let’s get all the sequences we can get for every sentence:

final_data = []
for i in range(len(queries)):
  data_dict = dict()
  query = []
  query.append(queries[i])
  tw = Tokenizer(filters='!"#%&()*+,-.:;<=>?@[\]^`{|}~ ')
  tw.fit_on_texts(query)
  wordlist = []
  for word, j in tw.word_index.items():
    wordlist.append(word)
  ents = data_list[i]
  data_dict['query'] = queries[i]
  sequences = []
  for word in wordlist:
    index = wordlist.index(word)
    seq = ''
    for w in wordlist[index:]:
      seq = seq + w
      sequences.append(seq)
      seq = seq + ' '
  data_dict['sequences'] = sequences
  final_data.append(data_dict)
print(final_data)
number_of_seq = 0
for element in final_data:
    number_of_seq = number_of_seq + len(element['sequences'])
print(number_of_seq)

The number of sequences for a sentence, is the sum of the words of a sentence starting from the first word and ending till the last.

If a sentence has 4 words, we get (4+3+2+1) = 10 sequences

Similarly, a sentence of 5 words will have (5+4+3+2+1) = 15 sequences. After executing that I get 164 sequences. So, 164 sequences in 8 sentences.

Now, the main part of getting our weights:

For calculating the mean vector of a sequence, we are considering the mean of vector mean of the chunk before the sequence and the vector mean of the sequence. If we do not get any chunk before the sequence, we are considering the mean of vector mean of the chunk after the sequence and the vector mean of the sequence. In a case where the sentence contains only one word or our sequence is the whole sentence and we have no other pre chunks or post chunks, we have to get the mean between the mean vector of the sequence and a numpy zero vector.

And then each of the vector is concatenated with each of the one hot encoded entities. I will post screen shots of the function since the code is

That was my function.

X_train = np.zeros((number_of_seq*len(entities), 50 + len(entities)))
Y_train = []
i = 0
for j in range(len(queries)):
  query = []
  query.append(queries[j])
  tw = Tokenizer(filters='!"#%&()*+,-.:;<=>?@[\]^`{|}~ ')
  tw.fit_on_texts(query)
  wordlist = []
  for word, k in tw.word_index.items():
    wordlist.append(word)
  seq = final_data[j]
  for sequence in seq['sequences']:
    wl = sequence.split(' ')
    index = wordlist.index(wl[0])
    chunk = ''
    for word in wordlist[:index]:
      chunk = chunk + word
      chunk = chunk + ' '
    if chunk == '':
      index = wordlist.index(wl[len(wl)-1])
      chunk = ''
      for word in wordlist[index+1:]:
        chunk = chunk + word
        chunk = chunk + ' '
    slots = data_list[j]
    slot = ''
    for key in slots:
      if sequence == key:
        slot = slots[sequence]
        break
    if chunk == '':
      chunk = None
    mean = get_X(i, sequence, chunk)
    
    mean = mean.reshape(50, 1)
    
    rows = onehot_entities.shape[0]
    for x in range(rows):
      oh = onehot_entities[x].reshape(len(entities), 1)
      conc = np.concatenate((mean, oh))
      conc = conc.reshape(1, 50 + len(entities))
      X_train[i] = conc
      inverted = label_encoder.inverse_transform([np.argmax(onehot_entities[x, :])])
      warnings.simplefilter('ignore')
      inverted = inverted.tolist()
      if inverted[0] == slot:
        Y_train.append(1.)
      else:
        Y_train.append(0.)
      i += 1

The sequence vector coupled with the correct one hot entities is labelled as 1 and the others as 0.

Y_train = np.asarray(Y_train)
print(X_train.shape)
print(Y_train.shape)

So, I have 1640 data points and 60 dimensions (50: vector dimension, 10: Number of entities).

Let’s see how they look like:

print(X_train)
print(Y_train)

Now, since we will be using a Binary Classification model along with a Bidirectional LSTM Dense Model, we need to reshape our weights and labels from 2-D to 3-D.

from sklearn.model_selection import train_test_split
X_tr, X_te, Y_tr, Y_te = train_test_split(X_train, Y_train, test_size = 0.1, shuffle = True, random_state = 1)
X_tr = X_tr.reshape(X_tr.shape[0], 1, 50+len(entities))
Y_tr = Y_tr.reshape(Y_tr.shape[0], 1, 1)
X_te = X_te.reshape(X_te.shape[0], 1, 50+len(entities))
Y_te = Y_te.reshape(Y_te.shape[0], 1, 1)

Now, the training part:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras.layers import Dropout
from keras.callbacks import EarlyStopping
model = Sequential()
model.add(Bidirectional(LSTM(8, return_sequences = True), input_shape = (1, 50 + len(entities))))
model.add(TimeDistributed(Dense(1, activation = 'sigmoid')))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['acc'])
model.summary()
es = EarlyStopping(monitor ='loss', patience = 1, verbose = 1, mode = 'auto')
model.fit(X_tr, Y_tr, batch_size = 4, epochs = 10, verbose = 1, callbacks = [es])

When you have a good enough data set, you should go for “monitor = val_loss” in the Early Stopping callback. Also, your model.fit() function should have “validation_split = 0.3” in it when you have bulk data.

Let’s check the scores:

scores = model.evaluate(X_tr, Y_tr, verbose = 1)
print("Train accuracy %s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

scores = model.evaluate(X_te, Y_te, verbose = 1)
print("Test accuracy %s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

Looks good enough but not something I can really trust on since we have only such small data. Working on a big data set now.

y_pred = model.predict_proba(X_te)
Y_te = Y_te.reshape(Y_te.shape[0], 1)
y_pred = y_pred.reshape(y_pred.shape[0], 1)
predicted = y_pred[:,0]
print(predicted)

Now, let’s check our ROC and AUC score (Read about what it represents before proceeeding):

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(Y_te, predicted)
print(fpr, tpr, thresholds)
roc_auc = auc(fpr, tpr)
print(roc_auc)

Okay, our roc_auc score is really low as compared to our accuracy but that is because we have just a few data points. Anyone can try out with a better data set and give me the score. Meanwhile, even I am on it.

import matplotlib.pyplot as plt
plt.figure()
plt.plot(fpr, tpr, color = 'darkorange',label = 'ROC Curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color = 'navy', linestyle = '--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.legend(loc = "lower right")
plt.show()

Not the curve I wanted to see but then again, less data.

plt.bar(fpr, tpr)

So, let’s test our model out with a sentence and see how it works:

test_query = "India is a great country"
test_queries = [test_query]
test_data = []
for i in range(len(test_queries)):
  data_dict = dict()
  query = []
  query.append(test_queries[i])
  tw = Tokenizer(filters='!"#%&()*+,-.:;<=>?@[\]^`{|}~ ')
  tw.fit_on_texts(query)
  wordlist = []
  for word, j in tw.word_index.items():
    wordlist.append(word)
  ents = data_list[i]
  data_dict['query'] = test_queries[i]
  sequences = []
  for word in wordlist:
    index = wordlist.index(word)
    seq = ''
    for w in wordlist[index:]:
      seq = seq + w
      sequences.append(seq)
      seq = seq + ' '
  data_dict['sequences'] = sequences
  test_data.append(data_dict)
number_of_seq = 0
for element in test_data:
    number_of_seq = number_of_seq + len(element['sequences'])
print(number_of_seq)

We get 15 sequences in this sentence.

X_example = np.zeros((number_of_seq*len(entities), 50 + len(entities)))
i = 0
for j in range(len(test_queries)):
  query = []
  query.append(test_queries[j])
  tw = Tokenizer(filters='!"#%&()*+,-.:;<=>?@[\]^`{|}~ ')
  tw.fit_on_texts(query)
  wordlist = []
  for word, k in tw.word_index.items():
    wordlist.append(word)
  seq = test_data[j]
  for sequence in seq['sequences']:
    wl = sequence.split(' ')
    index = wordlist.index(wl[0])
    chunk = ''
    for word in wordlist[:index]:
      chunk = chunk + word
      chunk = chunk + ' '
    if chunk == '':
      index = wordlist.index(wl[len(wl)-1])
      chunk = ''
      for word in wordlist[index+1:]:
        chunk = chunk + word
        chunk = chunk + ' '
    if chunk == '':
      chunk = None
    slots = data_list[j]
    slot = ''
    for key in slots:
      if sequence == key:
        slot = slots[sequence]
        break
    if slot == '':
      slot = None
    mean = get_X(i, sequence, chunk)
    
    mean = mean.reshape(50, 1)
    
    rows = onehot_entities.shape[0]
    for x in range(rows):
      oh = onehot_entities[x].reshape(len(entities), 1)
      conc = np.concatenate((mean, oh))
      conc = conc.reshape(1, 50 + len(entities))
      X_example[i] = conc
      i += 1
X_example = X_example.reshape(X_example.shape[0], 1, 50 + len(entities))
result = model.predict(X_example)
result = result.reshape(result.shape[0], 1)
result = result.tolist()

We will get the result as 1 single list but in order to get the result for each sequence in the sentence, we need to divide the list to chunks on 10 (number of entities) each.

def divide_chunks(l, n):
  for i in range(0, len(l), n):
    yield l[i:i + n]
for element in result:
  index = result.index(element)
  result[index] = element[0]
print(result)
result_list = list(divide_chunks(result, len(entities)))
print(result_list)

New list:

sequences = test_data[0]['sequences']
for i in range(len(sequences)):
  prob_list = result_list[i]
  ranked = sorted(prob_list, reverse = True)
  rank_dict = dict()
  for e in ranked:
    index = prob_list.index(e)
    rank_dict[entities[index]] = e
  print("Sequence: ", sequences[i])
  print("Rank: ", rank_dict)
  print('\n')

So, here every sequence will have an entity list rank (descending order). The entity will the highest probability (the first one in every rank dictionary) is the intended entity.

Its a big result, but let me show you the result of the word: “India” in it. That’s the only word which is required in the sentence for entity recognition.

Yoooo!! We got it correct. India was not even in our data set.

But, we would have liked to get higher confidence levels but that will improve when you have more data.

I will give the link to the iPython Notebook so that it is easier for all of you.

 

This was my first post about anything related to Machine Learning. All points of yours in the comments will be highly appreciated.

Also, anyone who tries with a bigger data set and gives me the results is always welcome.

Thanks… Ciao!

 

About the author

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.