WSOM 2005, Paris

Correct and efficient text classification is a major challenge in today’s world of rapidly increasing amount of accessible electronic text data. Kohonen networks have been applied to document classification with comparable success to other document clustering methods. An important challenge is to devise text similarity metrics that can improve the performance of text classification Kohonen networks by integrating more semantic information into the metric. Here we propose an augmented metric for text similarity that is based on the comparison of word consecutiveness graphs of documents. We show that using the proposed augmented similarity metric Kohonen networks perform better than Kohonen networks using usual Euclidean distance metric comparison of word frequency vectors. Our results indicate that word consecutiveness graph comparison includes more semantic information into the text similarity measure improving text classification performance.