ナイーブベイズ分類器やSVM分類器などを継続的に訓練させたいとき、訓練に使う文書からどのように特徴を選択・抽出したらよいのか、という問題がある。
まず、特徴選択においては、対象の文書に含まれる言葉に対して
相互情報量 (Mutual Information)
χ2乗値 (Chi-square)
情報ゲイン (Information Gain)
を計算して高いものを採用していく方法が一般的なようだが、例えば、新しいニュース記事が入る度にカテゴリを割り当てて分類器を訓練させていきたい場合はどういう手法が有効なのであろうか。
参考)
A comparison of feature selection methods for an evolving RSS feed corpus
http://www.scit.wlv.ac.uk/~cm1993/papers/comparison_feature_selection.pdf
Feature Selection: An Ever Evolving Frontier in Data Mining
http://jmlr.csail.mit.edu/proceedings/papers/v10/liu10b/liu10b.pdf
2012年6月21日木曜日
形態素解析の精度を上げる方法
日本語用のMeCab, Chasen, yahoo!形態素解析APIなどが有名だが、精度をさらに上げるにはどうすればよいかのメモ。
他にも「はてなキーワード」やtwitterトレンドなどをうまく使う方法がありそうだ。
思い付いたときにこの記事を更新していこうと思う。
Wikipediaの見出し語を抽出してMeCabの辞書を更新する
なるほど、その手があったか。他にも「はてなキーワード」やtwitterトレンドなどをうまく使う方法がありそうだ。
思い付いたときにこの記事を更新していこうと思う。
2012年6月20日水曜日
Python + MongoDBの使い方
PyMongoというモジュールを使うとPython上からカンタンにMongoDBが使える。
こんなかんじ。
#!/user/bin/python # -*- coding: utf-8 -*- from pymongo import Connection conn = Connection('localhost') # conn = Connection('xxx.xxx.xxx.xxx', 27017) #db - collection - {key1:value1, key2, value2, "_id":ObjcetId("~")} db = conn['test'] col = db['entry'] # Insert 10 objects # This may create duplicate entries for i in range(0,10): col.insert({'num1': i, 'num2': i*i}) # Print all for data in col.find(): print data # Print where 'num1' is 5 for data in col.find({'num1': 5}): print data # Delete something col.remove({'num1': 5}) # Update something target = col.find_one({'num1': 9}) target['num1'] = 99 col.save(target) for data in col.find(): print data conn.disconnect()
こんなかんじ。
2012年6月7日木曜日
どのアルゴリズムで分類すればよいか
色々な機械学習アルゴリズムがあるが、分類・クラスタリングを行うにはどれを使えばよいのかに関する調査のメモ。
http://gihyo.jp/dev/serial/01/machine-learning/0015
google Q&A上とても参考になる質疑応答を見つけた。
Rocchioアルゴリズム
http://gihyo.jp/dev/serial/01/machine-learning/0015
google Q&A上とても参考になる質疑応答を見つけた。
Rocchioアルゴリズム
Rocchio’s Alogrithem The Ricchio's Algorithm is based on the Relevancy Feedback Algorithms for Document Relevancy. Relevancy Feedback Models are an effective way of modifying and expanding user queries (such as search engines). Ricchio's Algorithm is one of the earliest methods used for queries. This algorithm is basesd on the idea that if the relevance for a query is known, an optimal query vector will maximize the average query-document similarity for relevant documents, and will simultaneously minimize query-document similarity for non relevant documents. If we look at the basic formula for the Ricchio's Algorithm, the intuitive idea of Ricchio's Algorithm is to iteratively increase the weights of those terms contained in labeled relevant documents while penalizing the terms in the irrelevant documents. Recent studies have shown that Ricchio's Algorithm has a poor performance when the proportion of relevant documents in the whole corpus is low. http://www.cs.uml.edu/~kajal/courses/91.580-S03/papers/Hoashi.pdf http://www-student.cse.buffalo.edu/~arao/CSE741/slides.html http://ifsc.ualr.edu/xwxu/publications/ecir03_hybrid.pdf
Naive Bayes Naive Byes algorithms are among the most successful known algorithms for learning to classify text documents. It predicts by reading a set of examples in attribute value-representation and than by using the Bayes Theorem to estimate the posterior probabilities of all qualifications. For each instance of the example language a classification with the highest posterier probability is chosen as the prediction. Example: Suppose your data consist of fruits, described by their color and shape. Bayesian classifiers operate by saying "If you see a fruit that is red and round, which type of fruit is it most likely to be, based on the observed data sample? In future, classify red and round fruit as that type of fruit." A difficulty arises when you have more than a few variables and classes -- you would require an enormous number of observations (records) to estimate these probabilities. Naive Bayes classification gets around this problem by not requiring that you have lots of observations for each possible combination of the variables. Rather, the variables are assumed to be independent of one another and, therefore the probability that a fruit that is red, round, firm, 3" in diameter, etc. will be an apple can be calculated from the independent probabilities that a fruit is red, that it is round, that it is firm, that is 3" in diameter, etc. In other words, Naïve Bayes classifiers assume that the effect of an variable value on a given class is independent of the values of other variable. This assumption is called class conditional independence. It is made to simplify the computation and in this sense considered to be “Naïve”. This assumption is a fairly strong assumption and is often not applicable. However, bias in estimating probabilities often may not make a difference in practice -- it is the order of the probabilities, not their exact values, that determine the classifications. Studies comparing classification algorithms have found the Naïve Bayesian classifier to be comparable in performance with classification trees and with neural network classifiers. They have also exhibited high accuracy and speed when applied to large databases. The problem with multinomial Naive Bayes is that when one class has more training examples than another, Naive Bayes selects poor weights for the decision boundary. This is due to an under-studied bias effect that shrinks weights for classes with few training examples. ANother systematic problem with Naive Bayes is that features are assumed to be independent. As a result, even when the words are dependent, each word contributes evidence individually. Thus the magnitude for the weights for classes with strong word dependencies is larger than for classes with weak word dependencies. http://www.wikipedia.org/wiki/Naive_Bayesian_classification http://www.dmg.org/pmmlspecs_v2/NaiveBayes.html http://www.cs.iastate.edu/~patterbj/cs/cs572/algs/naive.html http://www.resample.com/xlminer/help/NaiveBC/classiNB_intro.htm PORBLEMS WITH NAIVE BAYES http://www.ai.mit.edu/people/jrennie/papers/icml03-nb.pdf K-Nearest Neighbor The goal of this clustering method is to simply separate the data based on the assumed similarities between various classes. Thus, the classes can be differentiated from one another by searching for similarities between the data provided. The K-Nearest Neighbor is suitable for data streams. KNN does not build a classifier in advance. When a new sample arrives, KNN finds the K neighbors nearest to the new samples from the training space based on some suitable similarity or distance metric. KNN is a good choice when simplicity and accuracy are the predominant issues. KNN can be superior when a resident, trained and tested classifiers has a short useful lifespan, such as in the case with the data streams where new data is added rapidly and the training set is ever changing. KNN does not rely on prior probabilities, and it is computationally efficient. The main computation is the sorting of the training documents in order to find out the K nearest neighbors for the test document. K-Nearest Neighbor is useful when their are less than 20 attributes per instance, there is lots of training data, training is very fast, learning complex target functions and don't want to loose information. The disadvantages of using such a function is that it is slow in sorting out queries and irrelevant attributes can fool the neighbor. http://cs.hbg.psu.edu/~ding/publications/PAKDD02_KNN.pdf http://wwwcsif.cs.ucdavis.edu/~liaoy/research/text_ss02_html/node4.html http://www.cs.tufts.edu/~emower/KNN.html http://www4.cs.umanitoba.ca/~jacky/Teaching/Courses/74.436/current/Lectures/L07_Instance_Based_Learning.pdf Decision Tree The Decision Tree exploration engine, new to PolyAnalyst 4.1, helps solve the task of classifying cases into multiple categories. Decision Tree is PolyAnalyst's fastest algorithm when dealing with large amounts of attributes. Decision Tree report provides an easily interpreted decision tree diagram and a predicted versus real table. Problems to Solve: Classification of cases into multiple categories Target Attributes: Categorical or Boolean (Yes/No) attribute Output Format: Classification statistics Predicted versus Real table (confusion matrix) Decision Tree diagram Optimal Number of Records: Minimum of 100 records Maximum of 5,000,000 records Preprocessing Suggested: Summary Statistics - to deselect attributes that contain to many values to provide any useful insight to the exploration engine. Underlying Algorithms: Information Gain splitting criteria Shannon information theory and statistical significance tests. The Data Used: Decision Tree works on data of any type. The DT algorithm is well-poised for analyzing very large databases because it does not require loading all the data in machine main memory simultaneously. PolyAnalyst takes a full advantage of this feature by implementing incremental DT learning with the help of the OLE DB for Data Mining mechanism. The DT algorithm calculation time scales very well (grows only linearly) with increasing number of data columns. At the same time, it grows more than linearly with the growing number of data records - as N*log(N), where N is the number of records. Yet, for data of about 100,000 records, the DT algorithm is often the fastest exploration algorithm of PolyAnalyst. Problems to Solve: Decision Tree algorithm helps solving the task of classifying cases into multiple categories. In many cases, this is the fastest, as well as easily interpreted machine learning algorithm of PolyAnalyst. The DT algorithm provides intuitive rules for solving a great variety of classification tasks ranging from predicting buyers/non-buyers in database marketing, to automatically diagnosing patient in medicine, and to determining customer attrition causes in banking and insurance. Target Attribute: The target attribute of a Decision Tree exploration must be of a Boolean (yes/no) or categorical data type. When to Use This Algorithm: The Decision Tree exploration engine is used for task such as classifying records or predicting outcomes. You should use decision trees when you goal is to assign your records to a few broad categories. Decision Trees provide easily understood rules that can help you identify the best fields for further exploration. The Output: The Decision Tree report starts of by giving measures resulting from the decision tree. These measures are the Number of non-terminal nodes, Number of leaves, and depth of the constructed tree. Next, the report provides classification statistics on the decision tree. Problems: The drawback of this algorithm is that large number of gini indices have to be computed at each node of the decision tree. In order to decide which attribute is to be split at each node, the gini indices have to be computed for all the attributes and for each successive pair of values for all patterns which have not been classified OVERVIEW WITH AN EXAMPLE http://www.bandmservices.com/DecisionTrees.htm Support Vector Machine A support vector machine is a supervised learning algorithm developed over the past decade by Vapnik and others (Vapnik, Statistical Learning Theory, 1998). The algorithm addresses the general problem of learning to discriminate between positive and negative members of a given class of n-dimensional vectors. For example, if you have a series of mRNA expression level measurements for each of a large number of genes, the SVM can learn to answer questions such as, ``Does the given gene belong to functional class X?'' where X is some category such as ``ribosomal genes'' or ``sugar and carbohydrate transporters.'' If you have a collection of digitized images of handwritten digits, the SVM can say whether a given image is, say, the number 9. The SVM algorithm operates by mapping the given training set into a possibly high-dimensional feature space and attempting to locate in that space a plane that separates the positive from the negative examples. Having found such a plane, the SVM can then predict the classification of an unlabeled example by mapping it into the feature space and asking on which side of the separating plane the example lies. Much of the SVM's power comes from its criterion for selecting a separating plane when many candidates planes exist: the SVM chooses the plane that maintains a maximum margin from any point in the training set. Statistical learning theory suggests that, for some classes of well-behaved data, the choice of the maximum margin hyperplane will lead to maximal generalization when predicting the classification of previously unseen examples (Vapnik, Statistical Learning Theory, 1998). The SVM algorithm can also be extended to cope with noise in the training set and with multiple classes (Cristianini and Shawe-Taylor, An Introduction to Support Vector Machines, 2000). Say that we have a training data set containing n examples, each of which is a vector of m numbers. These vectors may be thought of as points in an m-dimensional space. In theory, a simple way to build a binary classifier is to construct a hyperplane (i.e., a plane in a space with more than three dimensions) separating class members (positive examples) from non-members (negative examples) in this space. Unfortunately, most real-world problems involve non-separable data for which there does not exist a hyperplane that successfully separates the positive from the negative examples. One solution to the inseparability problem is to map the data into a higher-dimensional space and define a separating hyperplane there. This higher-dimensional space is called the feature space, as opposed to the input space occupied by the training examples. With an appropriately chosen feature space of sufficient dimensionality, any consistent training set can be made separable. However, translating the training set into a higher-dimensional space incurs both computational and learning-theoretic costs. Furthermore, artificially separating the data in this way exposes the learning system to the risk of finding trivial solutions that overfit the data. SVMs elegantly sidestep both difficulties. They avoid overfitting by choosing the maximum margin separating hyperplane from among the many that can separate the positive from negative examples in the feature space. Also, the decision function for classifying points with respect to the hyperplane only involves dot products between points in the feature space. Because the algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space, a support vector machine can locate the hyperplane without ever representing the space explicitly, simply by defining a function, called a kernel function, that plays the role of the dot product in the feature space. This technique avoids the computational burden of explicitly representing the feature vectors. http://www-ai.cs.uni-dortmund.de/SOFTWARE/SVM_LIGHT/svm_light_v3.02.eng.html http://www.afia.polytechnique.fr/CAFE/ECML01/SVM.html Latent Semantic Analysis Latent Semantic Analysis (LSA) analyzes word-word, word-passage, and passage-passage relationships. There's a good relationship between what LSA extracts and what people understand. LSA doesn't use first-hand knowledge of the world, but extracts from "episodes" of word content. LSA doesn't use word order or logic relationships. LSA uses unitary expressions of meaning instead of relationships between successive words. A word is a kind of average of meaning through all passages. Dimensionality is reduced to approximate human cognition. LSA is a theory of knowledge representation. Dimensionality reduction solves the problem of "insufficient evidence" or "poverty of the stimulus". LSA uses a matrix decomposition algorithm to reduce the dimensionality. Ideally, the dimension of the reconstruction equals that of the passages. Results show that the meaning similarities are close to that of humans, LSA's rate of knowledge acquisition approximates that of humans, and LSA depends on the dimensionality. LSA can be use to test theories of human cognition. LSA skips over the order of words to capture relationships in word choice. LSA uses a pre-processing step for word correlation over many passages and contexts. LSA uses a very large number of relationships. Theories of human cognition cannot be settled by theoretical and philosophical ideas. LSA is an automatic mathematical algorithm for extracting relationships in word usage in passages. It doesn't use dictionaries, external knowledge, or grammar rules. First, represent words as a matrix, each row is a word, and each column is a text passage or context. Next, do a preliminary transformation of the matrix. Each word frequency in the Next, LSA applies singular value decomposition (SVD) to the matrix, which is factor analysis. The decomposition results in dimensionality reduction. Extract words from the passage into a word matrix, do a linear decomposition of the matrix, then reduce the dimensionality of the matrix. The LSA matrix adds words not in the passage, like human minor knowledge acquisition. LSA is intuitively sensible, with a three-fourths gain in total comprehension vocabulary inferred from knowledge about words not in the passage or paragraph. Human children have a rapid growth of vocabulary and knowledge. Humans draw conclusions from missing data. Reducing the dimensionality of representation is useful when the representation matches the data. The data should not be perfectly regenerated. The similarity of dimensionality reduction is the cosine between vectors. Before the SVD is computed, LSA does a data preprocessing matrix data transformation. Save the log word frequency, and the entropy for each row and column of the word. Weight each word occurrence by an estimate of its importance in the passage. Knowing a word provides information about the passage it appeared in. Matrix transformations are used in information retrieval and human cognition models. A web site provides LSA based word or passage vectors, similarities between words and words, words and passages, and passages and passages. LSA is able to model human conceptual knowledge. LSA links information retrieval and human semantic memory. Latent Semantic Indexing (LSI), like LSA, was tested against pre-examined documents. Direct comparisons were muddied by preprocessing words. LSA does synonym tests, since most near neighbors are related by the cosine. LSA vectors were created from many passages. LSA captures synonymity by knowledge of captured vocabularies. TOEFL vocabulary simulates human performance between word choice. LSA errors were compared to student errors. The role of dimension reduction was analyzed. LSA simulates word sorting and word relationships. Subject-matter knowledge and sematic priming are in LSA. Predictive learning and text comprehension for humans. http://www.uni-koblenz.de/~fruit/publications/Obst99c.html http://www.cs.toronto.edu/~psy/lsa.pdf http://www.cs.nmsu.edu/~mmartin/LSA_Intro_AI_Seminar.ppt RESOURCE http://citeseer.nj.nec.com/563891.html Voted Classification http://216.239.39.104/search?q=cache:CcXO0h6c-Z4J:robotics.stanford.edu/users/ronnyk/vote.pdf+voted+classification+algorithm&hl=en&ie=UTF-8 ********************************************************* ************DOCUMENTS & PAPERS*************************** ********************************************************* http://research.microsoft.com/~sdumais/cikm98.pdf The research compares the effectiveness of five different text algorithms (Rocchio's Algorithm, Decision Tree, Naive Bayes, Bayes Nets & Support Vector Machines) in terms of learning speed, real time classification speed and classification accuracy. http://citeseer.nj.nec.com/lewis91evaluating.html The report discusses a variety of ways of evaluating the effectiveness of text categorization systems. http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf This is a very detailed study of machine learning in Automated Text Classification, different algorithms and issue pertaining to document representation, classifier construction and classifier evaluation. http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html This document describes some of the basics of text categorization and how to access the quality of each. The paper surveys the machine learning text classification algorithms and measure their efficiency to predict the best classification method. http://www.pace.edu/library/pages/pdf/saleeb_thesis.pdf This is a thesis written by one of the doctorate candidates of the PAce University. It not only clearly defines different sets of text categorization algorithms but also compare and contrast them for different datasets so as to help system designers chose an appropriate algorithm given their system constraints and application domains. (see page 37 of the thesis which is Page 47 of the acrobat reader). http://www.ldv.uni-trier.de:8080/ldvpage/naumann/textklassifikation/Textklassifikation/yang99reexamination.pdf This paper reports a controlled study with statistical significant tests on five text categorization methods and their robustness in dealing with different situations. http://citeseer.nj.nec.com/update/90507 This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, making those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect comparisons, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and neural network method had the best performance, except for a Naive Bayes approach, the other algorithms performed relatively well. ****MINOR RESOURCES**** http://216.239.53.104/search?q=cache:BmQikVKWy4AJ:www.ee.usyd.edu.au/~kenw/Thesis.pdf+compare+text+categorization+algorithms&hl=en&ie=UTF-8 The paper describes four machine learning techniques that are common with the Text categorization and then evaluate their performance later. http://www.ics.uci.edu/~pazzani/Yiming.ppt Power Point presentation of five text classifiers (Read the conclusion) Hope this will help you. I will try to post more resources if I come across any. Please clarify if you have any questions regarding this research or if you need more help. Sincerely, leader-ga=================Clarification of Answer by leader-ga on 05 Aug 2003 08:05 PDTSorry for the delay. I have researched several articles and compiled a summary of the advantages and disadvantages of the models. I am unable to find any article that specifically relates to the comparison of each model and than prove for various situations, which is the best for each. I am still working on it. Thanks. SUMMARY Vector Space Model The vector space model, is the most popular model and is claimed to be the best performing model for general document collections. Advantages and disadvantages: (+) The model is relatively easy to understand. (+) The model can be applied without modification on new collections. (+) The model gives a very good retrieval quality in the SMART environment. (-) It is not (yet) clear what happens if the model is applied to documents that are longer and more complicated than the short documents (abstracts) for which it was developed. (-) There is no sound theoretical basis for the assumption that documents that resemble the query, are relevant for the question. Rocchio’s Algorithm It is one of the earliest algorithm. It s main advantage over the others is that it is easy to learn and implement although the classification accuracy may not be of the highest quality. Advantages & Disadvantages (+)Easy to implement (+)Very fast learner (+)Relevance feedback mechanism (-)Low classification accuracy (-)Linear combination too simple for classification (-)Constant and are empirical Latent Semantic Analysis LSA was originally developed for the task of information retrieval but in recent years have produced remarkably human like abilities in a variety of language tasks and is the best algorithm used for non-English retrieval systems. Advantages and Disadvantages (+) Best IR for non-English document retrieval (-) Disregard syntax (-) Disregards morphology (-) Computationally expensive Decision Tree Model Decision Tree is the fastest algorithm when dealing with large amounts of attributes because it depends on a root system and remove problem complexity. Advantages and Disadvantages (+)Easy to understand (+)Easy to generate rules (+)Reduce problem complexity (+)Training time is relatively expensive (-)A document is only connected with one branch (-)Once a mistake is made at a higher level, any sub tree is wrong (-)Does not handle continuous variable well (-)May suffer from over fitting KNN Compared to other text categorization methods such as Bayesian classifier, kNN does not rely on prior probabilities, and it is computationally efficient. The main computation is the sorting of training documents in order to find the k nearest neighbors for the test document. Advantages and Disadvantages (+)Conceptually simple, yet able to form complex (+)Decision boundaries (+)Can work with relatively little information (+)Combine naturally with human problem-solving (+)Learning is simple (-)Classification time is long (-)Difficult to find optimal value of k Naïve Bayes Naive Bayes classifier is a reasonable approach to many user modeling problems, given its advantages of quick learning and low computational overhead. Advantage and Disadvantages (+)Work well on numeric and textual data (+)Easy to implement and computation comparing with other algorithms (-)Conditional independence assumption is violated by real-world data, perform very poorly when features are highly correlated (-)Does not consider frequency of word occurrences Voted Classification One of the most effective algorithm as it uses weights to calculate the relevancy of the classified text. Advantages and Disadvantages (+)Surprisingly effective (+)Robust to noise (+)Decrease the overfitting effect (-)Require more calculation and memory RESOURCES Naïve Bayes Advantages & Uses http://citeseer.nj.nec.com/stern99naive.html Rocchio’s Algorithm http://216.239.37.104/search?q=cache:3n2c2zmaWiMJ:www.ai.mit.edu/people/jimmylin/papers/Arampatzis00.ps+rocchio%27s+algorithm+&hl=en&ie=UTF-8 http://www.citeseer.nj.nec.com/context/1362225/339280 http://www.ai.mit.edu/projects/jmlr/papers/v3/crammer03b.html Latent Semantic Analysis http://citeseer.nj.nec.com/290200.html http://216.239.39.104/search?q=cache:1_rfIhM8sGsJ:internal.autotutor.org/papers/howlate.pdf+Latent+Semantic+Analysis+advantages&hl=en&ie=UTF-8 http://www.citeseer.nj.nec.com/context/884350/98549 - 13k Decision Tree Model http://www.megaputer.com/products/pa/algorithms/dt.php3 http://www.megaputer.com/products/pa/algorithms/dt.php3 K-Nearest Neighbor http://wwwcsif.cs.ucdavis.edu/~liaoy/research/text_ss02_html/node4.html http://www.cs.umd.edu/~brabec/quadtree/nearest.html http://www.mathforum.org/~ken/categorize/007.html http://www.kiew.cs.uni-dortmund.de:8001/mlnet/ instances/81d91e93-df4da3c279 Vector Space Model http://pi0959.kub.nl:2080/Paai/Onderw/Smart/Origdoc/ir_models.html Thanks for the clarifications. I am still trying to find the relevant documents or papers. Sincerely, leader-ga==========Hello mcsemorgan-ga: My research revealed some additional papers that you may like. The first four of these papers compare different text categorization methods for a certain study. Please let me know if these are the kind of papers that you will consider. Thanks for being patient with me. (Although I have checked but If a link doesn’t work, please cut and paste the address in your browser window). Following are the papers: A re-examination of text categorization method http://citeseer.nj.nec.com/cache/papers/cs/26885/http:zSzzSzranger.uta.eduzSz~alpzSzixzSzreadingszSzYangSigir99CategorizationBenchmark.pdf/yang99reexamination.pdf SUMMARY This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a NaiveBayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances). A comparison of several classification algorithms for Word and Sense based text categorization http://users.auth.gr/~kehagiat/KehPub/ journal/2001WordsSenses.pdf An evaluation of Statistical approaches to text categorization http://citeseer.nj.nec.com/cache/papers/cs/1951/http:zSzzSzwww.cs.cmu.eduzSz~yimingzSzpapers.yyzSzcmu-cs-97-127.pdf/yang97evaluation.pdf SUMMARY This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had rela... Feature Preparation in Text Classification http://technet.oracle.com/products/text/pdf/feature_preparation.pdf SUMMARY We report text categorization accuracy for different types of features and different types of feature weights. The comparison of these classifiers shows that stemmed or un-stemmed single words as features give better classifier performance compared with other types of features. ***************Other Useful Papers****************** Summarization as feature selection of text categorization http://portal.acm.org/citation.cfm?id=502647&coll=portal&dl=ACM&ret=1#Fulltext SUMMARY The paper addresses the problem of evaluating the effectiveness of summarization techniques for the task of document categorization. It is argued that for a large class of automatic categorization algorithms, extraction-based document categorization can be viewed as a particular form of feature selection performed on the full text of the document and, in this context, its impact can be compared with state-of-the-art feature selection techniques especially devised to provide good categorization performance. Such a framework provides for a better assessment of the expected performance of a categorizer if the compression rate of the summarizer is known Improving text categorization methods for event tracking http://citeseer.ist.psu.edu/cache/papers/cs/13898/http:zSzzSzwww.cs.cmu.eduzSz~yimingzSzpapers.yyzSzsigir00.pdf/yang00improving.pdf/ SUMMARY Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different ... SVM out performers other text categorization methods for Web searches http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf SUMMARY This paper explores the use of Support Vector Machines SVMs for learning text classifiers from examples It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task Empirical results support the theoretical findings SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks Furthermore they are fully automatic eliminating the need for manual parameter tuning. Using text categorization techniques for intrusion detection (KNN) http://216.239.37.104/search?q=cache:Kn8PR6oeYyAJ:www.cs.ucdavis.edu/~vemuri/papers/text_ss02.pdf+applying+text+categorization+methods&hl=en&ie=UTF-8 Sincerely, leader-ga==========SVM ADVANTAGES Reduced computational complexity due to discrete kernel function Faster training times Good classification accuracy, Fast to classify new instances DISSADVANTAGES There is currently no on-line learning version of the SVM. The SVM is trained on data normally using a batch processing method. Predictions from a SVM are not Probabilistic . In regression the SVM outputs a point estimate, and in classification, a `hard' binary decision. The initial SVM model was designed to separate two classes. In other words there wasn't originally a SVM that was designed for multiple class separation. The requirement to estimate a trade-o parameter and the need to utilize `Mercer' kernel functions. REFERERENCE http://www.cs.ucsd.edu/~rik/cogsci200-s01/dumaisSVM.htm http://www.citeseer.nj.nec.com/context/1093570/0 http://www.cs.unr.edu/~bebis/MathMethods/SVM/lecture.pdf http://www.support-vector.net/ kiew.cs.uni-dortmund.de:8001/mlnet/ instances/81d91eaa-da14013dc6 citeseer.nj.nec.com/saunders98support.html Sorry! I had an emergency situation and couldn't reply. Please refer to the reference papers and you can have a very good idea of the advantages and disadvantages related to SVM. Sincerely, leader-ga
登録:
投稿 (Atom)