Technical

Classification Engine

You can run our classification engine locally, see Local Classification Server for download and licenses.

Behind the scenes our classification engine runs in the Amazon cloud, here is some technical information.

Specification

  • Runs as a Windows service
  • Communicates with XML over sockets
  • Multi purpose
  • Highly accurate
  • Low on memory
  • Very fast
  • Parallel request handling
  • Robust
  • Transactional behavior
  • Probabilities [0-1] in the result

Architecture

The classification server runs as a Microsoft Windows service driven by XML calls over sockets (which makes it easy to integrate with other Operating Systems). Responses are also returned in XML. The API is very similar to that of our free web API.

Flexibility

The classifier has no limitations in terms of what it can classify, it can be used for spam classification, sentiment, web page categorizing etc. There is no limit of how many classes a classifier can have. Classifiers are thread safe (Readers/Writers lock) so classes and new training data can be added dynamically.

Low on resources

The server, developed in C++, is designed to handle huge amounts of data without compromising the accuracy. It's able to keep many huge classifiers in memory simultaneously and respond quickly to classifications.

To give you an idea of how fast it's we recently ran a test on a modern PC, batching blog posts (average 2.4kb) through 5 large classifiers. The throughput was +100 posts/second (including the communication). That is 360000 posts/hour! On one core.

Also, it handles multiple requests in parallel, this is very nice if you have multiple cores!

Stability

We spent a lot of time to get it really robust, making sure that it won't crash or misbehave under any circumstances. Even though the host machine runs out of memory (and has no page file) it will survive, giving proper error messages.

It uses transactional behavior to ensure that classifiers are not left in an undefined state if a write operation unexpectedly fails. For example if the server runs out of memory while training a class, the training is reverted and an error message is returned.

Algorithm

The core is a multinomial Naive Bayesian classifier with a couple of steps that improves the classification further (hybrid complementary NB, class normalization and special smoothing). The result of classifications are probabilities [0-1] of a document belonging to each class. This is very useful if you want to set a threshold for classifications. E.g. all classifications over 90% is considered spam. Using this model also makes it very scalable in terms of CPU time for classification/training.