Using machine learning to cluster more than a million articles.

Blendle is a Dutch online news platform that houses many newspapers and magazines. They receive approximately 8000 articles per day from hundreds of sources which often describe the same real-life events from different perspectives. Combine this daily stream with the existing database of more than 2.5 million articles and you get an enormous constantly growing dataset.

Consider the problem of being given 2.5 million articles plus a daily addition of 8000 articles and wanting to structure them. By hand, this takes ages and the results are highly subjective. Fortunately this can be done much faster by means of clustering algorithms.

We created a system that produces clusters of high quality, in a short amount of time, at a scalable size and with clearly visualised results in D3. The system is written in Python.

Currently, the system is being used to categorise articles on