Averaging vs Partition Averaging TLDR; This post discusses my NLP research work on representing documents by weighted partition word vectors averaging . (15-minute read) tags: document representation, partition word vectors averaging Prologue Let's consider a corpus (C) with N documents with the corresponding most frequent words vocabulary (V). Figure1 represents the word-vectors space of V, where similar meaning words occur closer to each other (a partition of word vocabulary based on word vector similarity). Few words are polysemic and belong to multiple topics with some proportion. In figure 1 we represent the topic number of the word in subscript and corresponding proportion in braces. Figure 1: Corpus C-word vocabulary partitioned using word2vec fuzzy clustering Let's consider a text document (d): "Data journalists deliver data science news to the general public. They often take part in interpreting the data models. Also, they cre
With the current surge in attempts to emulate human "intelligence", the importance of Machine Learning cannot be downplayed in any way. Machine Learning is the science of creating algorithms that are adaptive to data features. In simpler terms, it involves algorithms that change itself according to the data it is presented with. Take, for example, the "familiar" task of recommending products to users on an e-commerce site. A very popular approach is to study user data and find other users similar to you and then recommend the products they have looked at, to you. It is essentially identifying similar users but it adapts itself to extract the "important" features in the user data. To achieve this, the algorithm must be able to identify what is important, extract hidden features (eg. pattern in the sequence of page visits) etc. This is where Machine Learning comes in. Machine Learning, today, is immensely capable of diving "deep" into hug