A more detailed look at real world document classification

Jul 27, 2016 | |

A real world classifier has three components to it and we will look at each of these components individually to explain in a little bit more detail how a classifier works.

1. The dataset

As we demonstrated above, a statistical method of classification requires a collection of documents which have been manually tagged with their appropriate category. The quality of this dataset is by far the most important component of a statistical NLP classifier.

The dataset needs to be large enough to have an adequate number of documents in each class. For example if you wished to classify documents into 500 possible categories you may require 100 documents per category so a total of at least 50,000 documents would be required.

The dataset also needs to be of a high enough quality in terms of how distinct the documents in the different categories are from each other to allow clear delineation between the categories.

2. Preprocessing

In our simple examples, we have given equal importance to each and every word when creating document vectors. We could do some preprocessing and decide to give different weighting to words based on their importance to the document in question. A common methodology used to do this is TF-IDF (term frequency - inverse document frequency). The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears across the entire document set. This has the effect of giving a lower overall weighting to words which occur more frequently in the document set such as “a”, “it”, etc.

3. Classification Algorithm and Strategy

In our example above, the algorithm we used to classify our document was very simple. We classified the document by comparing the number of matching terms in the document vectors to see which class it most closely resembled. In reality, we may be placing documents into more than one category type and we may also be assigning multiple labels to a document within a given category type. We may also have a hierarchical structure in our taxonomy, and therefore require a classifier that takes that into account.

For example, using IPTC International Subject News Codes to assign labels, we may give a document, two labels simultaneously such as “sports event - World Cup” and “sport - soccer”, “sports” and “sports event” being the root category and “soccer” and “World Cup” being the child categories.

There are numerous algorithms used in classification such as Support Vector Machines (SVMs), Naive Bayes, Decision Trees the details of which are beyond the scope of this blog.

Conclusion

We hope that you now have a better understanding of the basics of document classification and how it works. As a recap, in supervised methods, a model is created based on a training set. A classifier is then trained on this manually tagged training dataset and is expected to predict any given document’s category from then on. The biggest factor affecting the quality of these predictions is the quality of the training data set. Keep an eye on our blog for more in the “Text Analysis 101” series.

Src: http://blog.aylien.com/