0%

Introduction

This article mainly introduces a word embedding trick when performing deep learning tasks in the NLP field.w2vembeddings。Reduce unnecessary memory usage and 95% load time. It also greatly enhances the management ability and reusability of the word embedding library.

Problem

When performing NLP tasks, Word2Vec is generally used for embedding. Generate a word vector matrix based on the current task. As below.

1
2
3
4
5
6
7
8
def embedding_matrix_p(embeddings_index, word_index, max_features):            
embed_size = embeddings_index.get('a').shape[0]
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word, np.zeros(embed_size))
embedding_matrix[i] = embedding_vector
return embedding_matrix

1
2
3
4
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
embedding_matrix, embed_size = embedding_matrix_p(embeddings_index, word_index, max_features)

But this have a problem, if you need to parse from the file (txt/bin) each time, it is time consuming, and the loaded into the memory vectors is not all used. If there are only 20,000 words divisions in a certain task, but the word vector files are generally contains several millions, the real effect is only a few percent. So how to solve this part of unnecessary time waste and memory usage?

Solution

After understanding some of the ways to deploy word2vec, I found that there are probably two of them:
1、Load to memory from local file.
2、REST API.
Solution 1 is also the mode described above. time and memory are wasted.
Solution 2 is to serve the request vector separately, and the architecture will be clearer. But if you need to use the global word vector, it will be more troublesome, and it is really necessary to start a service so complicated?

In my leisure time, referenceembeddings and do some other worksw2vembeddings
The main thing is to transfer the word vector content to SQLite. In essence, it is to change word2vec to a storage format. But since SQLite is a database and there is no server. So it is convenient and efficient, and the query is fast (you don’t need to use the write function, except when you convert the format for the first time).

Mainly functions:
Word vector library management
Word vector instant call, ready to use

Instructions:

1
2
3
4
5
6
7
8
9
10
from w2vembeddings.w2vemb import EMB
emb = EMB(name='tencent', dimensions=200) # Load the word vector library, equivalent to the original step of loading from the file
# Extract word vector matrix on demand
def embedding_matrix_p(emb, word_index, max_features):
embed_size = len(emb.get_vector('a'))
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_matrix[i] = np.array(emb.get_vector(word))
return embedding_matrix

It can save almost all unnecessary memory usage and 95% of the time.

Problem

Of course, this solution has problems. Like the REST API, it is not convenient to use the global information of the word vector. If you have such a requirement, it is recommended to refer to gensim.
This method is mainly for those who often use word2vec for research, as well as small online deployment scenarios.
Originally published in
andWechat

综述

本文主要介绍一个在进行NLP领域深度学习任务时的词嵌入小技巧w2vembeddings。减少不必要的内存占用以及95%的加载时间。也大大加强了词嵌入库的管理能力及复用性。

问题

在进行NLP任务的时候,一般会利用Word2Vec进行嵌入。生成一个基于当前任务的词向量矩阵。正如下面这样。

1
2
3
4
5
6
7
8
def embedding_matrix_p(embeddings_index, word_index, max_features):            
embed_size = embeddings_index.get('a').shape[0]
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word, np.zeros(embed_size))
embedding_matrix[i] = embedding_vector
return embedding_matrix

1
2
3
4
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
embedding_matrix, embed_size = embedding_matrix_p(embeddings_index, word_index, max_features)

但是这样就会有一个问题,如果每次都从文件(txt/bin)中load的话需要解析,是比较费时间的,而且load进内存中向量又不都用得到。假如某次任务分词下来只有2万个,而词向量文件一般都是几百万的级别,所以真正发挥作用的其实只有百分之几。那么如何解决这部分不必要的时间浪费及内存占用呢?

解决方法

了解了一些部署word2vec的方法之后,发现大概是这么两种:
1、从本地load到内存。
2、起服务,调用REST API。
方案1也就上面讲的模式。每次load需要时间,且会占用运行内存。
方案2也就是为请求词向量单独起个服务,这个架构会比较清晰。但是如果需要使用到全局词向量的话也会比较麻烦,而且,真的有必要起个服务这么麻烦吗?

在闲暇时参考embeddings的实现重新封装了一个包w2vembeddings
主要是将词向量内容迁移到SQLite中,本质上来讲也就是将word2vec换了一种存储格式。但是由于SQLite是数据库,且无服务器。所以方便高效,查询快速(也不需要用到写入功能,除了第一次转换格式的时候)。

主要有以下功能:
词向量库管理
词向量即时调用,随取随用

主要解决内存占用和加载耗时的问题。还可以将多个词向量库放在一起管理。
使用方法:

1
2
3
4
5
6
7
8
9
10
from w2vembeddings.w2vemb import EMB
emb = EMB(name='tencent', dimensions=200) # 加载词向量库,相当于原来从文件加载的步骤
# 按需提取词向量矩阵
def embedding_matrix_p(emb, word_index, max_features):
embed_size = len(emb.get_vector('a'))
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_matrix[i] = np.array(emb.get_vector(word))
return embedding_matrix

优势

可以节约掉几乎所有的不必要的内存占用,以及95%的时间。

问题

当然这个方案也有问题,跟REST API一样,想要使用词向量全局信息的时候也不方便。如果有这种需求,建议参考gensim。
这个方法主要面向那些经常使用word2vec做研究的人,以及小型的线上部署场景。
原文发表于

Python

pypi包发布

https://pypi.org/project/twine/

1
2
python setup.py sdist bdist_wheel
twine upload dist/* -u {username} -p {pwd} {--skip-existing}

Git

仓库权限添加

  • 生成本地密钥(如果本地密钥更新了,也需要同步更新远程配置的本地key)
  • 将本地key配备到远程仓库上实现免密登陆(拉取/提交等)
    1
    pbcopy < ~/.ssh/id_rsa.pub

开发环境

快速up虚拟环境 vagrant
vagrant cloud
vagrant init hashicorp/bionic64 // hashicorp/bionic64 should be from cloud
vagrant up
vagrant ssh
vagrant destroy

ML

特征工程资源

category_encoders
ce.CountEncoder
ce.TargetEncoder
ce.CatBoostEncoder

特征选择资源

sklearn.feature_selection
基于特征的选择(依据评分函数评价特征对目标的贡献, 𝜒2, ANOVA, F-value, 互信息)
基于模型的选择(l1, l2)

编程语言基础

### NodeJS
    - NodeJS basics.
    - Event Loop.
    - Loop Tick.
    - NodeJS core modules.
    - Node package manager.
    - Asynchronous JavaScript and Promises.
    - How to develop full API

论文资源

画图: sane_tikz

GPU 模型 debug

如果模型在GPU上报错,可能给出的报错信息非常的笼统, 这时如果把模型放到CPU端执行,可能给出的报错信息会更加的具体.

运筹优化求解器评测

由 Hans Mittelmann组织
[1]. http://plato.asu.edu/bench.html
[2]. http://plato.asu.edu/ftp/milp.html
![image](\images_source\_010_\or-sovler.png)