scikit-learn:4.2.3. Text feature extraction

时间:2015-07-22 10:53:01   收藏:0   阅读:155

http://scikit-learn.org/stable/modules/feature_extraction.html

4.2节内容太多,因此将文本特征提取单独作为一块。


1、the bag of words representation

将raw data表示成长度固定的数字特征向量,scikit-learn提供了三个方式:

tokenizing:给每一个token(字、词,粒度自己把握)一个整数索引id

counting:每个token在每个文档中出现的次数

normalizing:根据每个token在样本/文档中出现的次数 规范化/权重化 token的重要性。


重新理解什么是feature、什么事sample:


Bag of Words or “Bag of n-grams” representation:

general process (tokenization, counting and normalization) of turning a collection of text documents into numerical feature vectors,while completely ignoring the relative position information of the words in the document.


2、sparsity

每个文档中的词,只是整个语料库中所有词,的很小的一部分,这样造成feature vector的稀疏性(很多值为0)。为了解决存储和运算速度的问题,使用python的scipy.sparse包。



3、common vectorizer usage

CountVectorizer同时实现tokenizing和counting。

参数很多,但默认的就很合理了,适合大多数情况,具体参考:http://blog.csdn.net/mmc2015/article/details/46866537

>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...‘word‘, binary=False, decode_error=...‘strict‘,
        dtype=<... ‘numpy.int64‘>, encoding=...‘utf-8‘, input=...‘content‘,
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...‘(?u)\\b\\w\\w+\\b‘,
        tokenizer=None, vocabulary=None)
这边的例子说明了它的使用:

http://blog.csdn.net/mmc2015/article/details/46857887

包括fit_transform、transform、get_feature_names()、ngram_range=(min,max)、vocabulary_.get()等。。。。


4、tf-idf term weighting

解决(e.g. “the”, “a”, “is” in English) 某些词出现次数太多,却又不是我们关注的词的问题。

the text.TfidfTransformer class实现了mormalization:

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])
>>> transformer.idf_  #idf_保存fit之后的结果
array([ 1. ...,  2.25...,  1.84...])

another class called TfidfVectorizer that combines all the options of CountVectorizer andTfidfTransformer in a single model:

如果对于binary occurrence的feature,使用CountVectorizer的参数设置为binary更好。。。bernoulli Naive Bayes也更适合做estimator。


5、Decoding text files

text是由character组成,但file则由bytes组成,所以要让scikit-learn工作,首先要告诉他file的编码,那么 CountVectorizer就会自动解码了。默认的编码方式是UTF-8,解码后的character set称为Unicode。如果你加载的file编码方式不是UTF-8,有没有设置encoding参数,则会出现UnicodeDecodeError。

如果编码错误,try:

For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.

>>>
>>> import chardet    
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)[‘encoding‘])
...            for x in (text1, text2, text3)]        
>>> v = CountVectorizer().fit(decoded).vocabulary_    
>>> for term in v: print(v)                           

(Depending on the version of chardet, it might get the first one wrong.)



6、应用和实例

推荐看一下第三个例子。

In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:

In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:

Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):


7、bag of words的缺陷

misspelling、word derivations、word order dependece。拼写错误(word wprd wrod)、词汇的变形(word words、arrive arriving)、词汇之间的顺序及依赖关系。


使用N-gram而不要单单使用unigram。另外,还可以使用这里http://blog.csdn.net/mmc2015/article/details/46730289提到的词干分析方法。

给个例子,以char_wb为例了:

>>> ngram_vectorizer = CountVectorizer(analyzer=‘char_wb‘, ngram_range=(2, 2), min_df=1)
>>> counts = ngram_vectorizer.fit_transform([‘words‘, ‘wprds‘])
>>> ngram_vectorizer.get_feature_names() == (
...     [‘ w‘, ‘ds‘, ‘or‘, ‘pr‘, ‘rd‘, ‘s ‘, ‘wo‘, ‘wp‘])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
       [1, 1, 0, 1, 1, 1, 0, 1]])


下三部分有时间写。。。

8、Vectorizing a large text corpus with the hashing trick,使用hashing技巧vectorizing大语料库


9、Performing out-of-core scaling with HashingVectorizer


10、Customizing the vectorizer classes





版权声明:本文为博主原创文章,未经博主允许不得转载。

原文:http://blog.csdn.net/mmc2015/article/details/46997379

评论(0
© 2014 bubuko.com 版权所有 - 联系我们:wmxa8@hotmail.com
打开技术之扣,分享程序人生!