Note: this article is intended for the readers who are already familiar with word embeddings and their use
Take top 10000 common English words, and use them as a training set. Use linear regression to learn mapping from one to the other. Plugging this learned conversion works good enough even if the value of training/validation loss isn’t that low.
Structure of the article
- The problem:
I describe why it was necessary to have both: a well trained general word embedding like GloVe, and word vectors trained on domain-specific documents using FastText.
- Why not fine tuning
Why I couldn’t just fine tune the GloVe embeddings on the domain-specific documents.
- Mapping word embeddings from FastText to GloVe
FastText had better performance for domain-specific word embeddings. But my model had been pre-trained using GloVe embeddings. What do I do? Use frequent words as pivots to transform one space to the other.
When faced with lack of data, a machine learning enthusiast would take one of the four possible approaches
- Manually gather/create required data
- Transfer learning
- Data augmentation.
- Augmentation of data
One uses word embeddings in transfer learning to gather vector representation of words from text corpus which might not even be relevant to the problem at hand. They are a tremendously helpful. However, a resource deprived engineer like me faces some problems:
- Technical terms, and jargon used in your domain might be rare or even absent in the word embeddings dataset trained on Wikipedia. Even if they are present, they might not have the meaning that you want them to have. For instance, TILA is a well known federal act in US mortgage industries. It’s closest vector in the GloVE embedding is ‘tequila’. Yeah, you don’t want that.
- Publicly available trained models like GloVe, and FastText are not easy on a laptop with 4GB ram. My laptop can go as far as loading 6 billion words data with 100 dimensional word vectors. FastText didn’t care to provide less than 300 dimensional vectors, and same is the case with google news trained vectors using word2vec.
For the first point, one solution is to train your own word vectors on the documents relevant to the problem at hand. If you choose this, in certain tasks like information retrieval, when you want to first train on a general dataset like SQuAD, which contains words from a really wide set of domains, your vocabulary would be inadequate. Then it wouldn’t be possible to train your paragraph retrieval model on datasets like SQuAD and get good accuracy.
So you pre-train the model on a general dataset using GloVe word embeddings. Next step is to train it on the domain-specific problem. However, the problem as described in the second point appears, technical words and jargon are not represented in the GloVe dataset. You have to have correct word embeddings for those words.
That’s why you need both of the best worlds; huge vocabulary of GloVe, and domain-specific vocabulary learned from your own documents. How to merge them both? Possible options: fine-tune the GloVe embeddings on the domain documents or train a new model to learn the word embeddings of the missing words and figure out how to plug them into the model which has learned using GloVe.
Why not fine tuning?
- It’s not easy. Look at the issues posted on the GloVe repository and over here. Gensim also doesn’t support training GloVe embeddings even if you convert it into word2vec format.
- You are not sure to have learned accurate embeddings of the existing words which were wrong. Some information of the past documents, where it learned the irrelevant embedding from, will be retained.
- GloVe is old now. There are other better ways to obtain word vector representations. Particularly those like FastText which uses character level information. On our problem documents in the field of mortgages, we found FastText to perform better than word2vec.
We used FastText to find vectors for domain-specific words and terminologies by using the algorithm in Gensim library on some 360 MB of text data related to mortgages. Now we needed to figure out a way to plug these embeddings on a passage retrieval model trained on GloVe embeddings
Mapping word embeddings from FastText to GloVe
FastText and GloVe are two different algorithms learning vectors from different features of the input text data. They assign numerical values to different dimensions using different criteria. But would it be so crazy to think one could transform the word vectors from one space to other preserving the semantic knowledge during the transformation?
Consider the following intuition which is the motivation behind this work; all the nouns would presumably be clustered in both the spaces, so a simple rotation (and scalar multiplication) from the FastText vector space would put them around the same region where the nouns are present in the GloVe vector space. However, rotation alone wouldn’t preserve all the meaningful relationships between the words which could be in the form of distances and angles, which in turn will require translation, reflection, and some stretching/squishing for correct mapping. There might be some non-linearity involved too, but one doesn’t expect anything too complex.
So if we are careful we can learn meaningful mapping between the two spaces which generalises well on unseen word vectors.
The aim of the mapping is to transfer newly learned word vectors on a different vector space to a host space on which we operate. Here we have learned word vectors using FastText algorithm and we want to map the learned vectors on to the vector space of GloVe.
We note that the frequent words like the prepositions, articles, and some verbs and nouns are present in the domain specific documents on which FastText was trained. Since these common English words are used in same contexts and in same styles while representing the same meaning throughout the English language regardless of the technical domain, we use them as the pivots to transform the word vectors.
We take top-10000 most frequent words from the nltk brown corpus to be used as pivots. We split these words into training and validation set, and train our models to fit the word vectors from FastText space to the respective vectors in GloVe space. We minimise the mean squared loss between the predicted vectors and original GloVe vectors.
Various models were tried to reduce mean-squared-error loss for mapping. The input and output vectors both are 100 dimensional. A simple linear regression without bias (intercept term) works as good as the one with bias and even the one with tanh non-linearity included. However, including one hidden layer of 1000 neurons with tanh non-linearity reduces the validation loss, albeit not drastically. This means that translation and non-linearity wasn’t as much needed, and the two algorithms learn quite similar embeddings which are just rotated/stretched/squished/reflected versions of each other.
Linear regression without translation train_loss: 0.1842 - val_loss: 0.2103
One hidden layer tanh non-linearity train_loss: 0.1392 - val_loss: 0.1962
Despite not-so-small mean squared error. We find that the words in the test set (or the ones which we in the end wanted to be meaningfully represented in the GloVe space while being learned using FastText algorithm) do get transformed quite meaningfully. The word ‘TILA’ no longer matches with the word ‘tequila’ in the new GloVe vectors, but now resembles ‘rules’ and ‘laws’ as intended.
As for the model which was pre-trained using original glove embeddings, it is found to perform better on the information retrieval task on mortgage documents with the newly mapped embeddings. Specifically, the top-5 returned paragraphs for a query had the correct paragraph 61% of the times, as opposed to 48% score for the same task with the original GloVe embeddings.