0%

人工智能的发展时间在计算机时代的历史中应该也不算短了.

至今(2019-02)所谓的强人工智能仍是一个遥不可及的问题.

要接受人的控制就必须要能够预测整个“智能”若干步骤后的状态,即没有不确定性,一切尽在掌控.那么这种掌控有可能产生智能吗?今天的人工智能行业的探索或许将给我们答案.

如果受基因进化思想的影响(交叉变异,环境选择),那么智能的发展过程中可能是充满不确定性的,这就从根本上与我们要求的控制形成了冲突和矛盾.

如果我们真的需要拥抱不确定性的话,初初想来确实也蛮可怕的.

我个人其实倾向于加入不确定性方可产生新的智能的方案.那么可能存在的问题应该怎么思考呢?
个人感觉可以参考人这个系统的方案,人中有好人和坏人.机器智能中有好的机器智能和坏的机器智能,是否可以将为人设计的机制以某种方式参照到机器智能中去呢?这样对于现存的人这个系统来说是不是就可以接受了?

目标是,只要加入了整个机器智能的系统后,带来的增益是大于成本的,总体上来说这个机制应该就可以存在下去.

背景

我们知道,线性代数的矩阵中,其实是没有矩阵除法这个概念的.

所以这里讲的相除其实是两个维度完全一样的矩阵对应点一一相除的意思.

类似于python 中np.array/np.array

1
2
3
4
import numpy as np
a = np.array([[1,2], [3,4]])
b = np.array([[2,3], [1,4]])
print("mat a is: \n {0} \n mat b is:\n {1} \n mat a/b is: \n {2}".format(a, b, a/b))
mat a is: 
 [[1 2]
 [3 4]] 
 mat b is:
 [[2 3]
 [1 4]] 
 mat a/b is: 
 [[0.5        0.66666667]
 [3.         1.        ]]

但是查阅Spark的官方文档

矩阵的乘法运算是通过 rowmatrix mutiply densematrix 来实现的,加法和减法都没有问题.

但是对应点一一相除则没有相关的方法来达到目的,所以需要自己做一些变通,其实也非常简单.

但是,由于Spark是在RDD的进行并行计算的,所以不能将数据拿出来到比如np.array来实现相关的算法,因为这需要将所有数据collect到某个节点的内存中进行,显然不OK.

注意到DenseVector在Spark中是有除法的,所以结果也很简单,将两个矩阵构造在两个Densevector 矩阵的RDD中,然后通过RDD1.zip(RDD2)同时操作两个RDD矩阵实现矩阵的除法.

构造矩阵

1
2
from pyspark import SparkContext
from pyspark.mllib.linalg.distributed import RowMatrix
1
sc = SparkContext('local', 'local_test_recommend')
1
2
3
4
5
6
rowsa = sc.parallelize([[1, 2], [3, 4]])
rowsb = sc.parallelize([[2, 3], [1, 4]])
sma = RowMatrix(rowsa)
smb = RowMatrix(rowsb)
print(sma.rows.collect())
print(smb.rows.collect())
[DenseVector([1.0, 2.0]), DenseVector([3.0, 4.0])]
[DenseVector([2.0, 3.0]), DenseVector([1.0, 4.0])]

矩阵相除

1
sma.rows.zip(smb.rows).map(lambda x: x[0]/x[1]).collect()
[DenseVector([0.5, 0.6667]), DenseVector([3.0, 1.0])]

Introduction

This article mainly introduces a word embedding trick when performing deep learning tasks in the NLP field.w2vembeddings。Reduce unnecessary memory usage and 95% load time. It also greatly enhances the management ability and reusability of the word embedding library.

Problem

When performing NLP tasks, Word2Vec is generally used for embedding. Generate a word vector matrix based on the current task. As below.

1
2
3
4
5
6
7
8
def embedding_matrix_p(embeddings_index, word_index, max_features):            
embed_size = embeddings_index.get('a').shape[0]
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word, np.zeros(embed_size))
embedding_matrix[i] = embedding_vector
return embedding_matrix

1
2
3
4
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
embedding_matrix, embed_size = embedding_matrix_p(embeddings_index, word_index, max_features)

But this have a problem, if you need to parse from the file (txt/bin) each time, it is time consuming, and the loaded into the memory vectors is not all used. If there are only 20,000 words divisions in a certain task, but the word vector files are generally contains several millions, the real effect is only a few percent. So how to solve this part of unnecessary time waste and memory usage?

Solution

After understanding some of the ways to deploy word2vec, I found that there are probably two of them:
1、Load to memory from local file.
2、REST API.
Solution 1 is also the mode described above. time and memory are wasted.
Solution 2 is to serve the request vector separately, and the architecture will be clearer. But if you need to use the global word vector, it will be more troublesome, and it is really necessary to start a service so complicated?

In my leisure time, referenceembeddings and do some other worksw2vembeddings
The main thing is to transfer the word vector content to SQLite. In essence, it is to change word2vec to a storage format. But since SQLite is a database and there is no server. So it is convenient and efficient, and the query is fast (you don’t need to use the write function, except when you convert the format for the first time).

Mainly functions:
Word vector library management
Word vector instant call, ready to use

Instructions:

1
2
3
4
5
6
7
8
9
10
from w2vembeddings.w2vemb import EMB
emb = EMB(name='tencent', dimensions=200) # Load the word vector library, equivalent to the original step of loading from the file
# Extract word vector matrix on demand
def embedding_matrix_p(emb, word_index, max_features):
embed_size = len(emb.get_vector('a'))
embedding_matrix = np.random.normal(0, 0, (max_features, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_matrix[i] = np.array(emb.get_vector(word))
return embedding_matrix

It can save almost all unnecessary memory usage and 95% of the time.

Problem

Of course, this solution has problems. Like the REST API, it is not convenient to use the global information of the word vector. If you have such a requirement, it is recommended to refer to gensim.
This method is mainly for those who often use word2vec for research, as well as small online deployment scenarios.
Originally published in
andWechat