An efficient NLP Word2Vec usage method
Introduction
This article mainly introduces a word embedding trick when performing deep learning tasks in the NLP field.w2vembeddings。Reduce unnecessary memory usage and 95% load time. It also greatly enhances the management ability and reusability of the word embedding library.
Problem
When performing NLP tasks, Word2Vec is generally used for embedding. Generate a word vector matrix based on the current task. As below.1
2
3
4
5
6
7
8def embedding_matrix_p(embeddings_index, word_index, max_features):
embed_size = embeddings_index.get('a').shape[0]
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word, np.zeros(embed_size))
embedding_matrix[i] = embedding_vector
return embedding_matrix
1 | EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt' |
But this have a problem, if you need to parse from the file (txt/bin) each time, it is time consuming, and the loaded into the memory vectors is not all used. If there are only 20,000 words divisions in a certain task, but the word vector files are generally contains several millions, the real effect is only a few percent. So how to solve this part of unnecessary time waste and memory usage?
Solution
After understanding some of the ways to deploy word2vec, I found that there are probably two of them:
1、Load to memory from local file.
2、REST API.
Solution 1 is also the mode described above. time and memory are wasted.
Solution 2 is to serve the request vector separately, and the architecture will be clearer. But if you need to use the global word vector, it will be more troublesome, and it is really necessary to start a service so complicated?
In my leisure time, referenceembeddings and do some other worksw2vembeddings。
The main thing is to transfer the word vector content to SQLite. In essence, it is to change word2vec to a storage format. But since SQLite is a database and there is no server. So it is convenient and efficient, and the query is fast (you don’t need to use the write function, except when you convert the format for the first time).
Mainly functions:
Word vector library management
Word vector instant call, ready to use
Instructions:
1 | from w2vembeddings.w2vemb import EMB |
It can save almost all unnecessary memory usage and 95% of the time.
Problem
Of course, this solution has problems. Like the REST API, it is not convenient to use the global information of the word vector. If you have such a requirement, it is recommended to refer to gensim.
This method is mainly for those who often use word2vec for research, as well as small online deployment scenarios.
Originally published in
andWechat