メディアテクノロジー＆デザイン

2012年7月11日水曜日

文章のコサイン類似度

文章間の類似度を測るには色々な方法があるが、ここではbag-of-wordsなる単語ベクトル同士のコサイン類似度(cosine similarity)を計算するカンタンな例を記しておく。

まずコサイン類似度であるが、要はベクトル同士のなす角の大きさをコサインで表現し、1に近ければなす角度が小さいので2つのベクトルはお互いに似ているし、0に近ければぜんぜん似てない、ということである。

bag-of-wordsというのは形態素解析された（単語に分割された）文に含まれる単語を要素とするベクトルである。例えば、以下の文章の場合は、

文1：「人生で起こることは、すべて、皿の上でも起こる」 (三谷幸喜)

文ベクトル1 = ("人生", "で", "起こる", "こと", "は", "、", "すべて", "、", "皿", "の", "上", "で", "も", "起こる", "。")

といった具合である。実際には「で」とか「は」とか「の」とか「、」とかこういう類のものはあまり参考にならないことが多いので排除して

v1 = ("人生", "起こる", "こと", "すべて", "皿", "上")

といった形にしてから類似度を計算するのが好ましいだろう。(単語の選定はtf-idf等を参考にしたりする)
さて、次の文章と先の文章のコサイン類似度を計算することを考える。

文2：「もうすぐ、あなたの人生の上に起こる、驚くべきこと」

v2 = ("もうすぐ", "あなた", "人生", "上", "起こる", "驚く", "べき", "こと")

次に、文ベクトル1と文ベクトル2は次元が違うので、次元をそろえた表現ベクトルを用意する。例えば両文に登場した単語("人生", "起こる", "こと", "すべて", "皿", "上", "もうすぐ", "あなた", "驚く", "べき")の有無を1or0で表現したベクトルを作る。

v1' = (1, 1, 1, 1, 1, 1, 0, 0, 0, 0)

v2' = (1, 1, 1, 0, 0, 1, 1, 1, 1, 1)

このとき

Cosine Similarity = v1'・v2' / (|v1'|*|v2'|) = 4 / (sqrt(6) * sqrt(8)

= 0.58

とざっくり計算できる。この方法では類似度は相対的に判断するしかなさそうだが、0.5を越えればいくらか文章間に共通性はありそうである。かなり簡易な指標だが、重要度に応じて単語に重み付けを行う等、他の技と組み合わせるとより威力を発揮する。

コサイン類似度を返す超シンプルpythonコード例：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import mecabsplitter as splitter
import math, sys

def cossimilarity(s1, s2):
        w1 = splitter.wordsvector(s1)
        w2 = splitter.wordsvector(s2)
        intersec = set(w1).intersection(set(w2))
        return len(intersec)/math.sqrt(len(w1))/math.sqrt(len(w2))

if __name__ == '__main__':
        argvs = sys.argv
        argc = len(argvs)
        if (argc < 2):
                print 'Script needs at least 2 sentences for maximum cosine similarity calculation'
        
        s0 = argvs[1].decode('utf-8')
        maxcossim = 0
        for arg in argvs[2:]:
                s = arg.decode('utf-8')        
                cossim = cossimilarity(s0, s)
                print cossim
                if (cossim > maxcossim):
                        maxcossim = cossim
                        
        print maxcossim

形態素解析ライブラリmecabを使って文章を単語に分割し、特定の種類の単語だけベクトルに入れて返すmecabsplitter.py：

#!/usr/bin/python
# -*- coding: utf-8 -*-

#http://mecab.googlecode.com/svn/trunk/mecab/doc/posid.html

import MeCab

filteredPOS = [10, 11, 12, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]

def wordsvector(sentence):
        sentence = sentence.encode('utf-8')
        words = []
        tagger = MeCab.Tagger('')
        node = tagger.parseToNode(sentence)
        while node:
                # It is possible to filter out words based on posid
                if node.posid in filteredPOS:
                        words.append(node.surface)
                        #print node.surface + ' posid: ' + str(node.posid)
                node = node.next             
        
        return words
        
def featurevector(sentence):
        words = wordsvector(sentence)
        return dict([(w, 1) for w in words])

これをコマンドラインで実行した結果：

$ python cosinesimilarity.py "疲れた脳を職場でリフレッシュする方法" "疲れている脳が職場でリフレッシュする方法を探す目的"
0.816496580928

となり、まぁお互い似てる的なノリになる。
機会があればベクトル空間モデルのdimension reductionについてまた書こう。

2012年7月10日火曜日

マップとハッシュまとめ

マップとハッシュ

マップ

二分木によるツリーマップとハッシュ表によるハッシュマップがある。

ツリーマップ
・データの追加と削除がまぁまぁ高速
・データの検索もまぁまぁ高速

パフォーマンス
ラッキーなとき　O(logN)
アンラッキーなとき　O(N)

C++やJavaのmapクラスがコレ。

ハッシュマップ
・データの追加、削除、検索すごい高速
・メモリ消費量やや多め
・メモリ確保にゆとりを持たないとハッシュ衝突でパフォーマンス低下
・キーに対応するハッシュ値を求め、配列にデータ格納
・データを格納する構造体を指すポインタの配列を確保する必要あり

パフォーマンス
いつもほぼO(1)！

JavaにはHashMap/HashtableクラスあるけどC++のSTLにはない。
ちなみにハッシュ関数 (ハッシュかんすう、hash function) とは、あるデータが与えられた場合にそのデータを代表する数値を得る操作、または、その様な数値を得るための関数のこと。ハッシュ関数から得られた数値のことをハッシュ値または単にハッシュという。
Wikipediaに代表的なハッシュ関数が紹介されている。

2012年6月21日木曜日

進化するドキュメントからの特徴選択

ナイーブベイズ分類器やSVM分類器などを継続的に訓練させたいとき、訓練に使う文書からどのように特徴を選択・抽出したらよいのか、という問題がある。

まず、特徴選択においては、対象の文書に含まれる言葉に対して

相互情報量 (Mutual Information)

χ2乗値 (Chi-square)

情報ゲイン (Information Gain)

を計算して高いものを採用していく方法が一般的なようだが、例えば、新しいニュース記事が入る度にカテゴリを割り当てて分類器を訓練させていきたい場合はどういう手法が有効なのであろうか。

参考）
A comparison of feature selection methods for an evolving RSS feed corpus
http://www.scit.wlv.ac.uk/~cm1993/papers/comparison_feature_selection.pdf

Feature Selection: An Ever Evolving Frontier in Data Mining
http://jmlr.csail.mit.edu/proceedings/papers/v10/liu10b/liu10b.pdf

形態素解析の精度を上げる方法

日本語用のMeCab, Chasen, yahoo!形態素解析APIなどが有名だが、精度をさらに上げるにはどうすればよいかのメモ。

Wikipediaの見出し語を抽出してMeCabの辞書を更新する

なるほど、その手があったか。
他にも「はてなキーワード」やtwitterトレンドなどをうまく使う方法がありそうだ。
思い付いたときにこの記事を更新していこうと思う。

2012年6月20日水曜日

Python + MongoDBの使い方

PyMongoというモジュールを使うとPython上からカンタンにMongoDBが使える。

#!/user/bin/python
# -*- coding: utf-8 -*-

from pymongo import Connection

conn = Connection('localhost')
# conn = Connection('xxx.xxx.xxx.xxx', 27017)

#db - collection - {key1:value1, key2, value2, "_id":ObjcetId("~")}
db = conn['test']

col = db['entry']

# Insert 10 objects
# This may create duplicate entries
for i in range(0,10):
        col.insert({'num1': i, 'num2': i*i})

# Print all
for data in col.find():
        print data

# Print where 'num1' is 5
for data in col.find({'num1': 5}):
        print data
        
# Delete something
col.remove({'num1': 5})
        
# Update something
target = col.find_one({'num1': 9})
target['num1'] = 99
col.save(target)

for data in col.find():
        print data

conn.disconnect()

こんなかんじ。

2012年6月7日木曜日

どのアルゴリズムで分類すればよいか

色々な機械学習アルゴリズムがあるが、分類・クラスタリングを行うにはどれを使えばよいのかに関する調査のメモ。

http://gihyo.jp/dev/serial/01/machine-learning/0015

google Q&A上とても参考になる質疑応答を見つけた。

Rocchioアルゴリズム

Rocchio’s Alogrithem

The Ricchio's Algorithm is based on the Relevancy Feedback Algorithms for Document Relevancy. Relevancy Feedback Models are an effective way of modifying and expanding user queries (such as search engines).

Ricchio's Algorithm is one of the earliest methods used for queries. This algorithm is basesd on the idea that if the relevance for a query is known, an optimal query vector will maximize the average query-document similarity for relevant documents, and will simultaneously minimize query-document similarity for non relevant documents. 

If we look at the basic formula for the Ricchio's Algorithm, the intuitive idea of Ricchio's Algorithm is to iteratively increase the weights of those terms contained in labeled relevant documents while penalizing the terms in the irrelevant documents.

Recent studies have shown that Ricchio's Algorithm has a poor performance when the proportion of relevant documents in the whole corpus is low.

http://www.cs.uml.edu/~kajal/courses/91.580-S03/papers/Hoashi.pdf
http://www-student.cse.buffalo.edu/~arao/CSE741/slides.html
http://ifsc.ualr.edu/xwxu/publications/ecir03_hybrid.pdf

Naive Bayes

Naive Byes algorithms are among the most successful known algorithms for learning to classify text documents. It predicts by reading a set of examples in attribute value-representation and than by using the Bayes Theorem to estimate the posterior probabilities of all qualifications. For each instance of the example language a classification with the highest posterier probability is chosen as the prediction.

Example:

Suppose your data consist of fruits, described by their color and shape. Bayesian classifiers operate by saying "If you see a fruit that is red and round, which type of fruit is it most likely to be, based on the observed data sample? In future, classify red and round fruit as that type of fruit."

A difficulty arises when you have more than a few variables and classes -- you would require an enormous number of observations (records) to estimate these probabilities.

Naive Bayes classification gets around this problem by not requiring that you have lots of observations for each possible combination of the variables. Rather, the variables are assumed to be independent of one another and, therefore the probability that a fruit that is red, round, firm, 3" in diameter, etc. will be an apple can be calculated from the independent probabilities that a fruit is red, that it is round, that it is firm, that is 3" in diameter, etc.

In other words, Naïve Bayes classifiers assume that the effect of an variable value on a given class is independent of the values of other variable. This assumption is called class conditional independence. It is made to simplify the computation and in this sense considered to be “Naïve”.

This assumption is a fairly strong assumption and is often not applicable. However, bias in estimating probabilities often may not make a difference in practice -- it is the order of the probabilities, not their exact values, that determine the classifications.

Studies comparing classification algorithms have found the Naïve Bayesian classifier to be comparable in performance with classification trees and with neural network classifiers. They have also exhibited high accuracy and speed when applied to large databases.

The problem with multinomial Naive Bayes is that when one class has more training examples than another, Naive Bayes selects poor weights for the decision boundary. This is due to an under-studied bias effect that shrinks weights for classes with few training examples. ANother systematic problem with Naive Bayes is that features are assumed to be independent. As a result, even when the words are dependent, each word contributes evidence individually. Thus the magnitude for the weights for classes with strong word dependencies is larger than for classes with weak word dependencies.

http://www.wikipedia.org/wiki/Naive_Bayesian_classification
http://www.dmg.org/pmmlspecs_v2/NaiveBayes.html
http://www.cs.iastate.edu/~patterbj/cs/cs572/algs/naive.html
http://www.resample.com/xlminer/help/NaiveBC/classiNB_intro.htm
PORBLEMS WITH NAIVE BAYES
http://www.ai.mit.edu/people/jrennie/papers/icml03-nb.pdf

K-Nearest Neighbor

The goal of this clustering method is to simply separate the data based on the assumed similarities between various classes. Thus, the classes can be differentiated from one another by searching for similarities between the data provided.

The K-Nearest Neighbor is suitable for data streams. KNN does not build a classifier in advance. When a new sample arrives, KNN finds the K neighbors nearest to the new samples from the training space based on some suitable similarity or distance metric.

KNN is a good choice when simplicity and accuracy are the predominant issues. KNN can be superior when a resident, trained and tested classifiers has a short useful lifespan, such as in the case with the data streams where new data is added rapidly and the training set is ever changing. KNN does not rely on prior probabilities, and it is computationally efficient. The main computation is the sorting of the training documents in order to find out the K nearest neighbors for the test document.

K-Nearest Neighbor is useful when their are less than 20 attributes per instance, there is lots of training data, training is very fast, learning complex target functions and don't want to loose information. The disadvantages of using such a function is that it is slow in sorting out queries and irrelevant attributes can fool the neighbor.

http://cs.hbg.psu.edu/~ding/publications/PAKDD02_KNN.pdf
http://wwwcsif.cs.ucdavis.edu/~liaoy/research/text_ss02_html/node4.html
http://www.cs.tufts.edu/~emower/KNN.html
http://www4.cs.umanitoba.ca/~jacky/Teaching/Courses/74.436/current/Lectures/L07_Instance_Based_Learning.pdf

Decision Tree

The Decision Tree exploration engine, new to PolyAnalyst 4.1, helps solve the task of classifying cases into multiple categories. Decision Tree is PolyAnalyst's fastest algorithm when dealing with large amounts of attributes. Decision Tree report provides an easily interpreted decision tree diagram and a predicted versus real table.

Problems to Solve:

Classification of cases into multiple categories

Target Attributes:

Categorical or Boolean (Yes/No) attribute

Output Format:

Classification statistics

Predicted versus Real table (confusion matrix)

Decision Tree diagram

Optimal Number of Records:

Minimum of 100 records

Maximum of 5,000,000 records

Preprocessing Suggested:

Summary Statistics - to deselect attributes that contain to many
values to provide any useful insight to the exploration engine.

Underlying Algorithms:

Information Gain splitting criteria

Shannon information theory and statistical significance tests.

The Data Used:

Decision Tree works on data of any type. The DT algorithm is
well-poised for analyzing very large databases because it does not
require loading all the data in machine main memory simultaneously.
PolyAnalyst takes a full advantage of this feature by implementing
incremental DT learning with the help of the OLE DB for Data Mining
mechanism.

The DT algorithm calculation time scales very well (grows only
linearly) with increasing number of data columns. At the same time, it
grows more than linearly with the growing number of data records - as
N*log(N), where N is the number of records. Yet, for data of about
100,000 records, the DT algorithm is often the fastest exploration
algorithm of PolyAnalyst.

Problems to Solve:

Decision Tree algorithm helps solving the task of classifying cases
into multiple categories. In many cases, this is the fastest, as well
as easily interpreted machine learning algorithm of PolyAnalyst. The
DT algorithm provides intuitive rules for solving a great variety of
classification tasks ranging from predicting buyers/non-buyers in
database marketing, to automatically diagnosing patient in medicine,
and to determining customer attrition causes
in banking and insurance.

Target Attribute:

The target attribute of a Decision Tree exploration must be of a
Boolean (yes/no) or categorical data type.

When to Use This Algorithm:

The Decision Tree exploration engine is used for task such as
classifying records or predicting outcomes. You should use decision
trees when you goal is to assign your records to a few broad
categories. Decision Trees provide easily understood rules that can
help you identify the best fields for further exploration.

The Output:

The Decision Tree report starts of by giving measures resulting from
the decision tree. These measures are the Number of non-terminal
nodes, Number of leaves, and depth of the constructed tree. Next, the
report provides classification statistics on the decision tree.

Problems:

The drawback of this algorithm is that large number of gini indices
have to be computed at each node of the decision tree. In order to
decide which attribute is to be split at each node, the gini indices
have to be computed for all the attributes and for each successive
pair of values for all patterns which have not been classified

OVERVIEW WITH AN EXAMPLE

http://www.bandmservices.com/DecisionTrees.htm

Support Vector Machine

A support vector machine is a supervised learning algorithm developed over the past decade by Vapnik and others (Vapnik, Statistical Learning Theory, 1998). The algorithm addresses the general problem of learning to discriminate between positive and negative members of a given class of n-dimensional vectors. For example, if you have a series of mRNA expression level measurements for each of a large number of genes, the SVM can learn to answer questions such as, ``Does the given gene belong to functional class X?'' where X is some category such as ``ribosomal genes'' or ``sugar and carbohydrate transporters.'' If you have a collection of digitized images of
handwritten digits, the SVM can say whether a given image is, say, the number 9.

The SVM algorithm operates by mapping the given training set into a
possibly high-dimensional feature space and attempting to locate in
that space a plane that separates the positive from the negative
examples. Having found such a plane, the SVM can then predict the
classification of an unlabeled example by mapping it into the feature
space and asking on which side of the separating plane the example
lies. Much of the SVM's power comes from its criterion for selecting a
separating plane when many candidates planes exist: the SVM chooses
the plane that maintains a maximum margin from any point in the
training set. Statistical learning theory suggests that, for some
classes of well-behaved data, the choice of the maximum margin
hyperplane will lead to maximal generalization when predicting the
classification of previously unseen examples (Vapnik, Statistical
Learning Theory, 1998). The SVM algorithm can also be extended to cope
with noise in the training set and with multiple classes (Cristianini
and Shawe-Taylor, An Introduction to Support Vector Machines, 2000).

Say that we have a training data set containing n examples, each of
which is a vector of m numbers. These vectors may be thought of as
points in an m-dimensional space. In theory, a simple way to build a
binary classifier is to construct a hyperplane (i.e., a plane in a
space with more than three dimensions) separating class members
(positive examples) from non-members (negative examples) in this
space. Unfortunately, most real-world problems involve non-separable
data for which there does not exist a hyperplane that successfully
separates the positive from the negative examples. One solution to the
inseparability problem is to map the data into a higher-dimensional
space and define a separating hyperplane there. This
higher-dimensional space is called the feature space, as opposed to
the input space occupied by the training examples. With an
appropriately chosen feature space of sufficient dimensionality, any
consistent training set can be made separable. However, translating
the training set into a higher-dimensional space incurs both
computational and learning-theoretic costs. Furthermore, artificially
separating the data in this way exposes the learning system to the
risk of finding trivial solutions that overfit the data.

SVMs elegantly sidestep both difficulties. They avoid overfitting by
choosing the maximum margin separating hyperplane from among the many
that can separate the positive from negative examples in the feature
space. Also, the decision function for classifying points with respect
to the hyperplane only involves dot products between points in the
feature space. Because the algorithm that finds a separating
hyperplane in the feature space can be stated entirely in terms of
vectors in the input space and dot products in the feature space, a
support vector machine can locate the hyperplane without ever
representing the space explicitly, simply by defining a function,
called a kernel function, that plays the role of the dot product in
the feature space. This technique avoids the computational burden of
explicitly representing the feature vectors.

http://www-ai.cs.uni-dortmund.de/SOFTWARE/SVM_LIGHT/svm_light_v3.02.eng.html
http://www.afia.polytechnique.fr/CAFE/ECML01/SVM.html

Latent Semantic Analysis

Latent Semantic Analysis (LSA) analyzes word-word, word-passage, and
passage-passage relationships. There's a good relationship between
what LSA extracts and what people understand. LSA doesn't use
first-hand knowledge of the world, but extracts from "episodes" of
word content. LSA doesn't use word order or logic relationships. LSA
uses unitary expressions of meaning instead of relationships between
successive words. A word is a kind of average of meaning through all
passages.

Dimensionality is reduced to approximate human cognition. LSA is a
theory of knowledge representation. Dimensionality reduction solves
the problem of "insufficient evidence" or "poverty of the stimulus".
LSA uses a matrix decomposition algorithm to reduce the
dimensionality. Ideally, the dimension of the reconstruction equals
that of the passages. Results show that the meaning similarities are
close to that of humans, LSA's rate of knowledge acquisition
approximates that of humans, and LSA depends on the dimensionality.

LSA can be use to test theories of human cognition. LSA skips over the
order of words to capture relationships in word choice. LSA uses a
pre-processing step for word correlation over many passages and
contexts. LSA uses a very large number of relationships. Theories of
human cognition cannot be settled by theoretical and philosophical
ideas.

LSA is an automatic mathematical algorithm for extracting
relationships in word usage in passages. It doesn't use dictionaries,
external knowledge, or grammar rules. First, represent words as a
matrix, each row is a word, and each column is a text passage or
context. Next, do a preliminary transformation of the matrix. Each
word frequency in the Next, LSA applies singular value decomposition
(SVD) to the matrix, which is factor analysis. The decomposition
results in dimensionality reduction. Extract words from the passage
into a word matrix, do a linear decomposition of the matrix, then
reduce the dimensionality of the matrix. The LSA matrix adds words not
in the passage, like human minor knowledge acquisition. LSA is
intuitively sensible, with a three-fourths gain in total comprehension
vocabulary inferred from knowledge about words not in the passage or
paragraph. Human children have a rapid growth of vocabulary and
knowledge. Humans draw conclusions from missing data. Reducing the
dimensionality of representation is useful when the representation
matches the data. The data should not be perfectly regenerated. The
similarity of dimensionality reduction is the cosine between vectors.

Before the SVD is computed, LSA does a data preprocessing matrix data
transformation. Save the log word frequency, and the entropy for each
row and column of the word. Weight each word occurrence by an estimate
of its importance in the passage. Knowing a word provides information
about the passage it appeared in. Matrix transformations are used in
information retrieval and human cognition models. A web site provides
LSA based word or passage vectors, similarities between words and
words, words and passages, and passages and passages. LSA is able to
model human conceptual knowledge.
LSA links information retrieval and human semantic memory. Latent
Semantic Indexing (LSI), like LSA, was tested against pre-examined
documents. Direct comparisons were muddied by preprocessing words. LSA
does synonym tests, since most near neighbors are related by the
cosine. LSA vectors were created from many passages. LSA captures
synonymity by knowledge
of captured vocabularies. TOEFL vocabulary simulates human performance
between word choice. LSA errors were compared to student errors. The
role of dimension reduction was analyzed. LSA simulates word sorting
and word relationships. Subject-matter knowledge and sematic priming
are in LSA. Predictive learning and text comprehension for humans.

http://www.uni-koblenz.de/~fruit/publications/Obst99c.html
http://www.cs.toronto.edu/~psy/lsa.pdf
http://www.cs.nmsu.edu/~mmartin/LSA_Intro_AI_Seminar.ppt

RESOURCE

http://citeseer.nj.nec.com/563891.html

Voted Classification

http://216.239.39.104/search?q=cache:CcXO0h6c-Z4J:robotics.stanford.edu/users/ronnyk/vote.pdf+voted+classification+algorithm&hl=en&ie=UTF-8

*********************************************************

************DOCUMENTS & PAPERS***************************

*********************************************************

http://research.microsoft.com/~sdumais/cikm98.pdf

The research compares the effectiveness of five different text
algorithms (Rocchio's Algorithm, Decision Tree, Naive Bayes, Bayes
Nets & Support Vector Machines) in terms of learning speed, real time
classification speed and classification accuracy.

http://citeseer.nj.nec.com/lewis91evaluating.html

The report discusses a variety of ways of evaluating the effectiveness
of text categorization systems.

http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf

This is a very detailed study of machine learning in Automated Text
Classification, different algorithms and issue pertaining to document
representation, classifier construction and classifier evaluation.

http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html

This document describes some of the basics of text categorization and
how to access the quality of each.

The paper surveys the machine learning text classification algorithms
and measure their efficiency to predict the best classification
method.

http://www.pace.edu/library/pages/pdf/saleeb_thesis.pdf

This is a thesis written by one of the doctorate candidates of the
PAce University. It not only clearly defines different sets of text
categorization algorithms but also compare and contrast them for
different datasets so as to help system designers chose an appropriate
algorithm given their system constraints and application domains. (see
page 37 of the thesis which is Page 47 of the acrobat reader).

http://www.ldv.uni-trier.de:8080/ldvpage/naumann/textklassifikation/Textklassifikation/yang99reexamination.pdf

This paper reports a controlled study with statistical significant
tests on five text categorization methods and their robustness in
dealing with different situations.

http://citeseer.nj.nec.com/update/90507

This paper focuses on a comparative evaluation of a wide-range of text
categorization methods, including previously published results on the
Reuters corpus and new results of additional experiments. A controlled
study using three classifiers, kNN, LLSF and WORD, was conducted to
examine the impact of configuration variations in five versions of
Reuters on the observed performance of classifiers.
Analysis and empirical evidence suggest that the evaluation results on
some versions of Reuters were significantly affected by the inclusion
of a large portion of unlabelled documents, making those results
difficult to interpret and leading to considerable confusions in the
literature. Using the results evaluated on the other versions of
Reuters which exclude the unlabelled documents, the performance of
twelve methods are compared directly or indirectly. For indirect
comparisons, kNN, LLSF and WORD were used as baselines, since they
were evaluated on all versions of Reuters that exclude the unlabelled
documents. As a global observation, kNN, LLSF and neural network
method had the best performance, except for a Naive Bayes approach,
the other algorithms performed relatively well.

****MINOR RESOURCES****

http://216.239.53.104/search?q=cache:BmQikVKWy4AJ:www.ee.usyd.edu.au/~kenw/Thesis.pdf+compare+text+categorization+algorithms&hl=en&ie=UTF-8

The paper describes four machine learning techniques that are common
with the Text categorization and then evaluate their performance
later.

http://www.ics.uci.edu/~pazzani/Yiming.ppt

Power Point presentation of five text classifiers (Read the
conclusion)

Hope this will help you. I will try to post more resources if I come
across any. Please clarify if you have any questions regarding this
research or if you need more help.

Sincerely,
leader-ga

=================

Clarification of Answer by leader-ga on 05 Aug 2003 08:05 PDTSorry for the delay. I have researched several articles and compiled a
summary of the advantages and disadvantages of the models. I am unable
to find any article that specifically relates to the comparison of
each model and than prove for various situations, which is the best
for each. I am still working on it.

Thanks.

SUMMARY

Vector Space Model
The vector space model, is the most popular model and is claimed to be
the best performing model for general document collections.

Advantages and disadvantages:
(+) The model is relatively easy to understand.
(+) The model can be applied without modification on new collections.
(+) The model gives a very good retrieval quality in the SMART
environment.
(-) It is not (yet) clear what happens if the model is applied to
documents that are longer and more complicated than the short
documents (abstracts) for which it was developed.
(-) There is no sound theoretical basis for the assumption that
documents that resemble the query, are relevant for the question.

Rocchio’s Algorithm
It is one of the earliest algorithm. It s main advantage over the
others is that it is easy to learn and implement although the
classification accuracy may not be of the highest quality.

Advantages & Disadvantages
(+)Easy to implement
(+)Very fast learner
(+)Relevance feedback mechanism
(-)Low classification accuracy
(-)Linear combination too simple for classification
(-)Constant and are empirical

Latent Semantic Analysis
LSA was originally developed for the task of information retrieval but
in recent years have produced remarkably human like abilities in a
variety of language tasks and is the best algorithm used for
non-English retrieval systems.

Advantages and Disadvantages
(+) Best IR for non-English document retrieval
(-) Disregard syntax
(-) Disregards morphology
(-) Computationally expensive

Decision Tree Model
Decision Tree is the fastest algorithm when dealing with large amounts
of attributes because it depends on a root system and remove problem
complexity.

Advantages and Disadvantages
(+)Easy to understand
(+)Easy to generate rules
(+)Reduce problem complexity
(+)Training time is relatively expensive
(-)A document is only connected with one branch
(-)Once a mistake is made at a higher level, any sub tree is wrong
(-)Does not handle continuous variable well
(-)May suffer from over fitting

KNN
Compared to other text categorization methods such as Bayesian
classifier, kNN does not rely on prior probabilities, and it is
computationally efficient. The main computation is the sorting of
training documents in order to find the k nearest neighbors for the
test document.

Advantages and Disadvantages
(+)Conceptually simple, yet able to form complex
(+)Decision boundaries
(+)Can work with relatively little information
(+)Combine naturally with human problem-solving
(+)Learning is simple
(-)Classification time is long
(-)Difficult to find optimal value of k

Naïve Bayes
Naive Bayes classifier is a reasonable approach to many user modeling
problems, given its advantages of quick learning and low computational
overhead.

Advantage and Disadvantages
(+)Work well on numeric and textual data
(+)Easy to implement and computation comparing with other algorithms
(-)Conditional independence assumption is violated by real-world data,
perform very poorly when features are highly correlated
(-)Does not consider frequency of word occurrences

Voted Classification
One of the most effective algorithm as it uses weights to calculate
the relevancy of the classified text.

Advantages and Disadvantages
(+)Surprisingly effective
(+)Robust to noise
(+)Decrease the overfitting effect
(-)Require more calculation and memory

RESOURCES

Naïve Bayes Advantages & Uses
http://citeseer.nj.nec.com/stern99naive.html

Rocchio’s Algorithm
http://216.239.37.104/search?q=cache:3n2c2zmaWiMJ:www.ai.mit.edu/people/jimmylin/papers/Arampatzis00.ps+rocchio%27s+algorithm+&hl=en&ie=UTF-8
http://www.citeseer.nj.nec.com/context/1362225/339280
http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03b.html

Latent Semantic Analysis
http://citeseer.nj.nec.com/290200.html
http://216.239.39.104/search?q=cache:1_rfIhM8sGsJ:internal.autotutor.org/papers/howlate.pdf+Latent+Semantic+Analysis+advantages&hl=en&ie=UTF-8
http://www.citeseer.nj.nec.com/context/884350/98549 - 13k

Decision Tree Model
http://www.megaputer.com/products/pa/algorithms/dt.php3
http://www.megaputer.com/products/pa/algorithms/dt.php3

K-Nearest Neighbor
http://wwwcsif.cs.ucdavis.edu/~liaoy/research/text_ss02_html/node4.html
http://www.cs.umd.edu/~brabec/quadtree/nearest.html
http://www.mathforum.org/~ken/categorize/007.html
http://www.kiew.cs.uni-dortmund.de:8001/mlnet/
instances/81d91e93-df4da3c279

Vector Space Model
http://pi0959.kub.nl:2080/Paai/Onderw/Smart/Origdoc/ir_models.html

Thanks for the clarifications. I am still trying to find the relevant
documents or papers.

Sincerely,
leader-ga

==========

Hello mcsemorgan-ga:

My research revealed some additional papers that you may like. The
first four of these papers compare different text categorization
methods for a certain study. Please let me know if these are the kind
of papers that you will consider. Thanks for being patient with me.
(Although I have checked but If a link doesn’t work, please cut and
paste the address in your browser window).

Following are the papers:

A re-examination of text categorization method
http://citeseer.nj.nec.com/cache/papers/cs/26885/http:zSzzSzranger.uta.eduzSz~alpzSzixzSzreadingszSzYangSigir99CategorizationBenchmark.pdf/yang99reexamination.pdf
SUMMARY
This paper reports a controlled study with statistical significance
tests on five text categorization methods: the Support
Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier,
a neural network (NNet) approach, the Linear Leastsquares
Fit (LLSF) mapping and a NaiveBayes (NB) classifier.
We focus on the robustness of these methods in dealing
with a skewed category distribution, and their performance
as function of the training-set category frequency. Our results
show that SVM, kNN and LLSF significantly outperform
NNet and NB when the number of positive training
instances per category are small (less than ten), and that all
the methods perform comparably when the categories are
sufficiently common (over 300 instances).

A comparison of several classification algorithms for Word and Sense
based text categorization http://users.auth.gr/~kehagiat/KehPub/
journal/2001WordsSenses.pdf

An evaluation of Statistical approaches to text categorization
http://citeseer.nj.nec.com/cache/papers/cs/1951/http:zSzzSzwww.cs.cmu.eduzSz~yimingzSzpapers.yyzSzcmu-cs-97-127.pdf/yang97evaluation.pdf
SUMMARY
This paper is a comparative study of text categorization methods.
Fourteen methods are investigated, based on
previously published results and newly obtained results from
additional experiments. Corpus biases in commonly
used document collections are examined using the performance of three
classifiers. Problems in previously published
experiments are analyzed, and the results of flawed experiments are
excluded from the cross-method evaluation. As
a result, eleven out of the fourteen methods are remained. A
k-nearest neighbor (kNN) classifier was chosen for
the performance baseline on several collections; on each collection,
the performance scores of other methods were
normalized using the score of kNN. This provides a common basis for a
global observation on methods whose results
are only available on individual collections. Widrow-Hoff, k-nearest
neighbor, neural networks and the Linear Least
Squares Fit mapping are the top-performing classifiers, while the
Rocchio approaches had rela...

Feature Preparation in Text Classification
http://technet.oracle.com/products/text/pdf/feature_preparation.pdf
SUMMARY
We report text categorization accuracy for different types of features
and different types of feature weights. The comparison of these
classifiers shows that stemmed or un-stemmed single words as features
give better classifier performance compared with other types of
features.

***************Other Useful Papers******************

Summarization as feature selection of text categorization
http://portal.acm.org/citation.cfm?id=502647&coll=portal&dl=ACM&ret=1#Fulltext
SUMMARY
The paper addresses the problem of evaluating the effectiveness of
summarization techniques for the task of document categorization. It
is argued that for a large class of automatic categorization
algorithms, extraction-based document categorization can be viewed as
a particular form of feature selection performed on the full text of
the document and, in this context, its impact can be compared with
state-of-the-art feature selection techniques especially devised to
provide good categorization performance. Such a framework provides for
a better assessment of the expected performance of a categorizer if
the compression rate of the summarizer is known

Improving text categorization methods for event tracking
http://citeseer.ist.psu.edu/cache/papers/cs/13898/http:zSzzSzwww.cs.cmu.eduzSz~yimingzSzpapers.yyzSzsigir00.pdf/yang00improving.pdf/
SUMMARY
Automated tracking of events from chronologically ordered
document streams is a new challenge for statistical
text classification. Existing learning techniques
must be adapted or improved in order to effectively
handle difficult situations where the number of positive
training instances per event is extremely small,
the majority of training documents are unlabelled, and
most of the events have a short duration in time. We
adapted several supervised text categorization methods,
specifically several new variants of the k-Nearest Neighbor
(kNN) algorithm and a Rocchio approach, to track
events. All of these methods showed significant improvement
(up to 71% reduction in weighted error rates)
over the performance of the original kNN algorithm on
TDT benchmark collections, making kNN among the
top-performing systems in the recent TDT3 official evaluation.
Furthermore, by combining these methods, we
significantly reduced the variance in performance of our
event tracking system over different ...

SVM out performers other text categorization methods for Web searches
http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf
SUMMARY
This paper explores the use of Support Vector Machines SVMs for
learning text classifiers from examples It analyzes the particular
properties of learning with text data and identifies why SVMs are
appropriate for this task Empirical results support the theoretical
findings SVMs achieve substantial improvements over the currently best
performing methods and behave robustly over a variety of different
learning tasks Furthermore they are fully automatic eliminating the
need for manual parameter tuning.

Using text categorization techniques for intrusion detection (KNN)
http://216.239.37.104/search?q=cache:Kn8PR6oeYyAJ:www.cs.ucdavis.edu/~vemuri/papers/text_ss02.pdf+applying+text+categorization+methods&hl=en&ie=UTF-8

Sincerely,
leader-ga

==========

SVM

ADVANTAGES
Reduced computational complexity due to discrete kernel function
Faster training times
Good classification accuracy,
Fast to classify new instances

DISSADVANTAGES
There is currently no on-line learning version of
the SVM. The SVM is trained on data normally
using a batch processing method.

Predictions from a SVM are not
Probabilistic . In regression the SVM outputs a point estimate, and
in classification, a `hard' binary decision.

The initial SVM model was designed to separate
two classes. In other words there wasn't originally
a SVM that was designed for multiple class
separation.

The requirement to estimate a trade-o parameter and the need to
utilize `Mercer' kernel functions.

REFERERENCE
http://www.cs.ucsd.edu/~rik/cogsci200-s01/dumaisSVM.htm
http://www.citeseer.nj.nec.com/context/1093570/0
http://www.cs.unr.edu/~bebis/MathMethods/SVM/lecture.pdf
http://www.support-vector.net/
kiew.cs.uni-dortmund.de:8001/mlnet/ instances/81d91eaa-da14013dc6
citeseer.nj.nec.com/saunders98support.html

Sorry! I had an emergency situation and couldn't reply. Please refer
to the reference papers and you can have a very good idea of the
advantages and disadvantages related to SVM.

Sincerely,
leader-ga

登録: 投稿 (Atom)