自然语言处理(NLP):08-03 词向量word2vec


预训练词向量 Word+Character 300d

下载地址: https://github.com/Embedding/Chinese-Word-Vectors

gensim是一个方便的nlp工具

词向量

  • 词向量获取
  • 获取某个词向量和句子的向量
  • 相似文本的比较

导入词向量

gensim导入词向量需要词向量文件的首行是:所有的单词数 词向量的维度

import gensim
PRE_WORD_VECTOR = '../data/news/sgns.sogou.char'
model = gensim.models.KeyedVectors.load_word2vec_format(PRE_WORD_VECTOR,binary=False)

model

查看某个词相似词

help(model.most_similar)
Help on method most_similar in module gensim.models.keyedvectors:

most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None) method of gensim.models.keyedvectors.Word2VecKeyedVectors instance
    Find the top-N most similar words. Positive words contribute positively towards the
    similarity, negative words negatively.
    
    This method computes cosine similarity between a simple mean of the projection
    weight vectors of the given words and the vectors for each word in the model.
    The method corresponds to the `word-analogy` and `distance` scripts in the original
    word2vec implementation.
    
    Parameters
    ----------
    positive : :obj: `list` of :obj: `str`
        List of words that contribute positively.
    negative : :obj: `list` of :obj: `str`
        List of words that contribute negatively.
    topn : int
        Number of top-N similar words to return.
    restrict_vocab : int
        Optional integer which limits the range of vectors which
        are searched for most-similar values. For example, restrict_vocab=10000 would
        only check the first 10000 word vectors in the vocabulary order. (This may be
        meaningful if you've sorted the vocabulary by descending frequency.)
    
    Returns
    -------
    :obj: `list` of :obj: `tuple`
        Returns a list of tuples (word, similarity)
    
    Examples
    --------
    >>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
    [('queen', 0.50882536), ...]
model.most_similar('北京大学')
[('北大', 0.6751739978790283),
 ('中国北京大学', 0.6405676603317261),
 ('北京大学经济系', 0.6353614330291748),
 ('北京大学化学系', 0.6258565187454224),
 ('北京大学经济学院', 0.6239113211631775),
 ('清华大学', 0.623389720916748),
 ('北京大学数学系', 0.6190596222877502),
 ('北京联合大学', 0.6075736880302429),
 ('北京大学国家发展研究院', 0.6050190329551697),
 ('北京大学社会学系', 0.6039434671401978)]
model.most_similar('清华大学')
[('中国清华大学', 0.6473081111907959),
 ('清华大学研究院', 0.64178466796875),
 ('清华', 0.6361725330352783),
 ('北京大学', 0.623389720916748),
 ('清华学堂', 0.6214977502822876),
 ('清华大学环境学院', 0.6187784075737),
 ('来清华', 0.6136924028396606),
 ('清华大学电机系', 0.608849823474884),
 ('清华大学热能工程系', 0.6088232398033142),
 ('清华大学出版社', 0.5952858924865723)]

获取单词向量

这里已经某个单词的向量,我们可以获取某个句子的向量,然后通过cos计算文本的相似度

print(len(model.wv['北京大学']))
print(model.wv['北京大学'])
300
[ 0.222128  0.400037 -0.227236  0.154221 -0.16757  -0.029402  0.149328
  0.208378  0.026903 -0.019063  0.639548  0.288518 -0.155013  0.212193
 -0.364441  0.15234  -0.352696  0.369139  0.224686 -0.252323 -0.060435
  0.023563 -0.416039  0.175716  0.549988 -0.227344 -0.036248 -0.214194
 -0.210722  0.35745   0.161434 -0.315777 -0.482192  0.068977  0.259561
  0.559155 -0.328687 -0.003965 -0.034043  0.488579 -0.401134 -0.398123
 -0.216614  0.501402 -0.005595  0.66042  -0.265738 -0.446967  0.12192
  0.407698 -0.102871  0.161026  0.00762   0.416902 -0.254512  0.352373
 -0.642371  0.109026 -0.176015  0.871983  0.384889 -0.368025  0.149876
 -0.221868  0.416928 -0.248114 -0.158976 -0.455642 -0.109791  0.037019
 -0.466224  0.229097  0.150966 -0.638334 -0.101739  0.508677 -0.27648
 -0.420191  0.394759  0.057595 -0.043209  0.37829   0.281436 -0.51135
  0.464562 -0.65453   0.397757  0.139709  0.022303  0.17439  -0.447864
  0.029916 -0.166727 -0.142454  0.549698  0.006929  0.188701 -0.266575
  0.12619   0.618239 -0.346054  0.645234 -0.070742 -0.633241 -0.267914
  0.432474 -0.698169  0.82743   0.521025 -0.440382 -0.323184 -0.059138
 -0.13159  -0.005437 -0.051833  0.082998  0.150343 -0.156679  0.404294
 -0.448309 -0.153925  0.0341   -0.224543 -0.167912  0.124436 -0.139225
 -0.60137  -0.129434 -0.193239  0.428865 -0.522703  0.197434 -0.050462
  0.291615 -0.347122  0.078326 -0.15315  -0.576885  0.231202 -0.24723
 -0.544315  0.035712  0.057589 -0.358502 -0.024013 -0.043832  0.346528
  0.286627  0.095926 -0.963243  0.009543 -0.107851 -0.454459 -0.452275
 -0.360582  0.45673  -0.080686 -0.156356  0.412355 -0.031405 -0.093725
  0.029103 -0.375729  0.252378 -0.051546 -0.020617  0.283126  0.119412
  0.585655 -0.243526 -0.122636  0.16906   0.305912  0.070114 -0.225493
 -0.223918 -0.194622  0.460624  0.175633  0.436071 -0.034496 -0.202896
 -0.093507 -0.357434  0.206491  0.117626  0.407604  0.186866 -0.071872
 -0.313433 -0.409627 -0.169461 -0.170832 -0.399961 -0.542373  0.404423
 -0.224193  0.380582 -0.097363  0.259113  0.068973  0.524166 -0.25709
  0.492447  0.470238 -0.428237 -0.029203 -0.267129 -0.184672  0.042356
  0.117669  0.480988  0.386758  0.027932  0.005613  0.274978  0.33153
  0.20524   0.17237  -0.526942  0.003541  0.141401 -0.360294  0.053148
  0.013686  0.2059   -0.013035 -0.378749  0.206178  0.049989 -0.012331
  0.006176 -0.146461  0.255146 -0.351347  0.208241 -0.255842 -0.305561
 -0.227642  0.284109 -0.172163  0.083946  0.156488 -0.023492 -0.077335
 -0.095375 -0.102046 -0.14019  -0.408847 -0.20965  -0.192427 -0.425838
  0.2257    0.311977 -0.103679 -0.115884  0.499276  0.244835  0.395972
 -0.148485 -0.03555   0.260034  0.200699 -0.027907  0.060458  0.133048
  0.02941  -0.585752  0.591908  0.390191  0.109852 -0.235079  0.023691
 -0.49689   0.007575  0.293473  0.300401  0.106064  0.571459 -0.097116
 -0.057596 -0.139595  0.030312  0.742784 -0.013247 -0.11443   0.51364
  0.005405  0.465761 -0.127208  0.561319  0.416485  0.076532 -0.272079
 -0.350165  0.348433 -0.416548 -0.142728 -0.368308  0.004861]


/Users/zhengwenjie/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).
  """Entry point for launching an IPython kernel.
/Users/zhengwenjie/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).

文本句子扩展词

给出一个句子,如果句子比较短,我们可以通过相关词进行扩展。这样可以获取更多的信息

testwords = ['金融','股票','经济']
for w in testwords:
    print('*'*100)
    print(w)
    print(model.most_similar(w))
****************************************************************************************************
金融
[('金融服务', 0.6928197145462036), ('金融业', 0.6735764741897583), ('金融市场', 0.6495444774627686), ('金融贸易', 0.6441975831985474), ('金融卡', 0.6256759166717529), ('金融性', 0.622938871383667), ('金融界', 0.6226274967193604), ('金融机构', 0.6183921098709106), ('万金融', 0.6088610887527466), ('金融行业', 0.6075006723403931)]
****************************************************************************************************
股票
[('公司股票', 0.674756646156311), ('股票数', 0.6717764139175415), ('持股', 0.6144402027130127), ('型基金', 0.6022541522979736), ('股票投资', 0.598275363445282), ('类股票', 0.59478759765625), ('股价', 0.5928256511688232), ('股', 0.5917819738388062), ('蓝筹股', 0.5868805646896362), ('股票买卖', 0.5836142897605896)]
****************************************************************************************************
经济
[('经济发展', 0.7015445232391357), ('宏观经济', 0.6505186557769775), ('亚太经济', 0.6405400037765503), ('经济社会', 0.6297432780265808), ('总体经济', 0.6239001750946045), ('经济运行', 0.617388129234314), ('国民经济', 0.6091005206108093), ('繁荣经济', 0.6075688600540161), ('矿业经济', 0.6033905744552612), ('乡镇经济', 0.6002935171127319)]

句子的相似性

句子-> 分词/字-> 向量求和/求平均 -> 句子的向量表示

相关推荐
©️2020 CSDN 皮肤主题: 游动-白 设计师:白松林 返回首页