An efficient NLP Word2Vec usage method

Posted on 2019-01-08 Edited on 2019-11-09 In Tech

Introduction

This article mainly introduces a word embedding trick when performing deep learning tasks in the NLP field.w2vembeddings。Reduce unnecessary memory usage and 95% load time. It also greatly enhances the management ability and reusability of the word embedding library.

Problem

When performing NLP tasks, Word2Vec is generally used for embedding. Generate a word vector matrix based on the current task. As below.

def embedding_matrix_p(embeddings_index, word_index, max_features):            
    embed_size = embeddings_index.get('a').shape[0]
    embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_vector = embeddings_index.get(word, np.zeros(embed_size))
        embedding_matrix[i] = embedding_vector
    return embedding_matrix

EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
embedding_matrix, embed_size = embedding_matrix_p(embeddings_index, word_index, max_features)

But this have a problem, if you need to parse from the file (txt/bin) each time, it is time consuming, and the loaded into the memory vectors is not all used. If there are only 20,000 words divisions in a certain task, but the word vector files are generally contains several millions, the real effect is only a few percent. So how to solve this part of unnecessary time waste and memory usage?

Solution

After understanding some of the ways to deploy word2vec, I found that there are probably two of them:
1、Load to memory from local file.
2、REST API.
Solution 1 is also the mode described above. time and memory are wasted.
Solution 2 is to serve the request vector separately, and the architecture will be clearer. But if you need to use the global word vector, it will be more troublesome, and it is really necessary to start a service so complicated?

In my leisure time, referenceembeddings and do some other worksw2vembeddings。
The main thing is to transfer the word vector content to SQLite. In essence, it is to change word2vec to a storage format. But since SQLite is a database and there is no server. So it is convenient and efficient, and the query is fast (you don’t need to use the write function, except when you convert the format for the first time).

Mainly functions:
Word vector library management
Word vector instant call, ready to use

Instructions：

from w2vembeddings.w2vemb import EMB
emb = EMB(name='tencent', dimensions=200)  # Load the word vector library, equivalent to the original step of loading from the file
# Extract word vector matrix on demand
def embedding_matrix_p(emb, word_index, max_features):    
    embed_size = len(emb.get_vector('a'))
    embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
    for word, i in word_index.items():
        if i >= max_features: continue
        embedding_matrix[i] = np.array(emb.get_vector(word))
    return embedding_matrix

It can save almost all unnecessary memory usage and 95% of the time.

Problem

Of course, this solution has problems. Like the REST API, it is not convenient to use the global information of the word vector. If you have such a requirement, it is recommended to refer to gensim.
This method is mainly for those who often use word2vec for research, as well as small online deployment scenarios.
Originally published in
andWechat