When you open a popular online service, say YouTube, from your machine, a set of videos appears on its home page. Later, if you open the same service from another machine, it displays a different set of videos. Who is behind this? The simple answer is the YouTube machine learning (ML) algorithm. The algorithm quietly observes your browsing habits, learns from it and determines the kind of videos you prefer to watch. And based on the algorithm’s recommendations, YouTube displays a page (supposed to be) tuned to your interests.
Perhaps you may be wondering, what the heck is this machine learning stuff and how are solutions based on it different from the conventional problem-solving approach? Let us explain with an example. All of us are familiar with spam or junk emails that fall into our mail box regularly. We enlist the service of an antispam tool to identify and filter out these junk emails. Conventional spam detection services monitor certain attributes of the incoming mail and based on a set of rules compute a spam score for the mail. And if the spam score of the mail exceeds a threshold value, the tool marks it as a spam. Here, the success of the solution depends on the power of its rule set (the rules like, if the mail comes from certain kinds of domains/IP addresses or certain e-mail ids it could be spam; if it contains a certain set of words it could be junk etc.).
To facilitate the smooth functioning of this anti-spam solution, one needs to create a database of blacklisted domains, email ids and the like. Once the tool is up and running, it will certainly function well for a while. However, in no time you will find holes in the system. The content in the blacklist database (domain names, email ids, spam words, etc.) keeps on changing- spammers always get smarter- and unless you monitor these parameters continuously and update them (along with the rule sets) the system will become obsolete. The problem with this type of solution lies in the static nature of the data and rule set. So, the conventional static rule based system simply cannot scale.
Let us assume that we already have some emails marked as spam and non-spam, users themselves figure out the spam emails and transfer them to the spam box. Now, one can learn the patterns of the spam emails by analysing this data. Once the features of a typical junk mail are identified, the software can use this information to mark the new incoming mail as spam or not. If the new mail conforms to the pattern, the system will mark it as spam. This is the crux of the machine learning approach. The distinctive feature of this approach is the meticulous use of the data already archived.
In an ML based anti-spam solution, the new email passes through a machine learning algorithm (classifier), which checks with a corpus of emails. Based on the inputs from the pre-existing corpus of data, the ML algorithm marks the new mail as spam or non-spam. The corpus is dynamic as it gets updated constantly with user’s input and the algorithm also adjusts itself to these changes in the input. So, the solution, being organic in nature, will always be in sync with the real world happenings. The defining character of the ML approach is that it learns from the corpus of data — learning means identifying the patterns in the corpus of data and using them to make decisions in the future. As it gets more data, the algorithm learns how to do the task better. Contrast this approach with the conventional algorithm with static nature. The conventional solution does not listen to feedback from users — it is missing out on the opportunity to learn from the users’ feedback.
Of late, ML has gained immense attention thanks to the widespread availability of data. Many of our activities now take place in the virtual world, where even our minute movements can be tracked. When you read an article or view a video online or purchase a merchandise or make an online payment, an ML algorithm lurking somewhere on the Net is learning more about you. Like the all-knowing God, it probably understands you better than you understand yourself. Our credit card transactions, call records, bank transactions, social media postings, travel details and the like are getting recorded. These digital trails we leave behind as we navigate around the e-world enhance the all-knowing power of the algorithm.
ML-based algorithms are all set to take over all aspects of our life; it is spreading virally across all domains. According to a story that recently appeared in an online journal, “companies such as Facebook, GE, IBM, Hilton Worldwide, SAP and many others have been slowly adding data analytics into their recruitment practices”. As a job hunter, you may have all qualities that count well in the current job market. You may be well-versed in the subject, have enough experience, have good communication skills and to cap it all, your interpersonal skills could also be good. With a well-crafted resume highlighting these aspects, you should be able to impress the powers that be and grab the job. Unfortunately, in the coming days, this strategy alone may not help you fetch gainful employment. To get an opportunity even to meet the prospective employer, your primary hurdle will be to impress the machine learning algorithm that controls the recruitment process.
Yet another field worth a mention in this regard is the publishing industry. Digital reading helps publishers understand everything about your reading habits — the books and novels you read quickly, the parts of the books you read partially and so on. They use this information to figure out the real requirements of a reader and extrapolate the data to determine the kind of books you might order next. Though tens of thousands of new fiction works are published every year, only a minute percentage of them gain the “bestseller” tag. What makes them tick? Jodie Archer and Matthew Jockers, authors of the recently published book “The Bestseller Code” claim they have cracked the secret code behind the success of bestselling novels. Analysing a huge corpus of contemporary novels they invented an algorithm titled “bestseller-ometer”, which can predict whether a novel will be a bestseller or not.