Two tools were integrated to create the ability to supply html documents and perform machine learning algorithms. The effectiveness of the machine learning algorithms are used as a basis to determine their individual accuracy across a different yet similiar taxonomy. For example, ODP (Open Dictionary Project) provides a given taxonomy which is has been populated with resources via volunteer classifiers. Similarly, CNN has established its own taxonomy. Given these two example taxonomies it is quite likely that their exists some overlap. For example, both have a Sports category containing subcategories of Baseball, Football, and Basketball, to name just a few.
Evaluation of a series of machine learning algorithms are to be used to determine their ability to be trained with resources from a subset of one taxonomy A that is determined, via human expertise, to overlap some subset of a second taxonomy B. Then classification accuracy will be based on the ability of the algorithms to properly classify resources from B into to their corresponding category in A. In addition, a classifier committees will be evaluated to determine if and by what factor classification accuracy can be improved.
Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.
Rainbow is the underlying classification code. It is based on the Bow library. Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). For more information about obtaining the source and citing its use, see the Bow home page Bow home page.
Rainbow first indexes a file, by turning the file's stream of characters into tokens by a process called tokenization or "lexing". It supports several options for tokenizing text. For example the --skip-headers (or -h) option causes rainbow to skip newsgroup or email headers before beginning tokenization.Once indexing is performed and a model has been archived to disk, rainbow can perform document classification. Statistics from a set of training documents will determine the parameters of the classifier; classification of a set of testing documents will be output.
Learning algorithms used:
GNU Wget is a freely available network utility to retrieve files from the World Wide Web, using HTTP and FTP. This tools was integrated into our project to be used to download files for training, testing and classification. It has many useful features: