Thursday, January 4, 2018

NLP and me!

Good morning and welcome to my first post on this blog.


To my first time readers, my area of research is Statistical NLP (SNLP). This area mainly deals with developing statistical models to represent language phenomenon and design algorithms to process it. So my research problem is to come up with a semantic representation of sentences using which we might be able to identify paraphrases in sentence pairs, or even better score them based on their similarity in meaning. The problem is much discussed one in the research community, no wonder, since 2012 ACL has been conducting a SemEval competition every year for this.

So we shall start with a brief intro to how meanings are represented in SNLP.  In most applications language text are addressed at the word level, or as a block of n continuous word (known as n-grams). To represent these words, we use word vectors, ranging from simple one-hot encoding to complex neural embeddings. As with the rest of the world, NLP has also been taken over by the deep learning wave, so much of the current works revolves around DL which makes use of distribution semantic models (DSM).

What is DSM? DSM is based on the distributional hypothesis that, words that occur in similar context tends to have similar meaning. So you attempt to represent the meaning of the word in terms of its context. This context may be documents in which they occur, sentences, co-occurring words in a fixed size window, Part-Of-Speech, grammatical roles they play etc. Earlier DSM models were count vectors. Thanks to the advancement in neural networks, researchers have come up with word-embedding vectors (WE), which tends to capture not just the meanings but also the relationship between words. The most widely used models are mikolov's  word2vec model and google's GloVe model. They exhibit an interesting property that vec(king) -vec(man) = vec(queen). And also words with similar meaning cluster together in the semantic space

How is this achieved? Given the word and the context, the neural network model is trained to predict it's context (as in word2vec), or probability of the context (as in GloVE), by which it learns the representation of the word itself. WE have become the de facto for word representation, which naturally led to the question of how a longer text like sentences or documents could be embedded in the same way. For the past few decades, we have been using bag-of-words model for Information retrieval tasks. In BOW model, documents are considered as a collection of words, where each word is represented by a count vector. To represent these documents and the queries, we just average the vectors of words in them. The document with the closest vector to the query vector is retrieved, in response to the query. These have worked well for document clustering and classification tasks as well.

Despite of its huge success, there has always been debates that BOW model is insufficient for intelligent applications, as language texts are not just a collection of words. They have a specific order, and the meaning they convey is reflected in their syntax as well. Quite obvious from the example, "Dog chased the cat" isn't same as the "Cat chased the dog".  For the past decade, many researchers have attempted to compose the meaning of these sentences from its constituent words. This is known as "principle of compositionality". In traditional NLP, when word meaning were atomic symbols, the meaning composition was done along the syntax trees recursively to get the representation of the whole text. But this wouldn't work well with the DSM models. Vector operations like addition, point-multiplication, convolution etc. has also been tried out in many research works. Unfortunately, none of them promised a notable improvement over the basic BOW model, except for some very specific application, and not for sentences longer than few words.

In 2014,  Richard Socher proposed in his PhD thesis - DL for NLP, a recursive neural network model that makes use of not just the constituent words but also the parse structure of the sentences to compose a sentence representation for sentiment analysis of movie reviews. Socher has also proposed a dynamic pooling classifier model that uses his sentence representation, for classifying sentence pairs as paraphrases or not. This work set out a series of ripples in this area, which became an inspiration to similar research around the globe. But we are still far from the actual goal of having a representation where the relative vector distance reflects the semantic similarity between the sentence pair, which is the case with word embeddings.

Apart from the theoretical motivation, a sentence embedding has a wide range of application including paraphrase identification, question answering, summary generation, semantic search etc. With the word embeddings fast gaining popularity, simple BOW methods are no longer sufficient for applications that use these word embeddings.

Where do we stand? To quote, Neil Lawrence words, “NLP is kind  of  like  a  rabbit  in  the  headlights  of  the  Deep  Learning  machine,  waiting  to  be flattened”. Though one category of linguistics still strongly believes that language is too complex to be represented by some numbers and statistical models won't make progress beyond few phrases, I hope that theoretical and statistical NLP shall go hand-in-hand trying to compliment each other and make our world (and my research) much simpler :)