Mirror Question Pair Detection
Contents
Mirror question pair classification
Problem statement
This is an online competition hosted by one Chinese company which aims to predict whether a given pair of questions actually share the same meaning semantically. Because of privacy, all original questions are encoded as sequences of char ID and word ID. Char may contain single Chinese word, single English letter, punctuation and space. Word may contain Chinese and English words, punctuation and space.
Dataset desciption
- char_embed.txt
300-dimension char embedding vector trained by Google word2vec. - word_embed.txt
300-dimension word embedding vector trained by Google word2vec. - question.csv
Contains all questions in the train set and test set. Each question contains sequence of char ID and word ID. - train.csv
Question pairs in the train set. - test.csv
Question pairs in the test set.
Solution
I tried both traditional methods and end-to-end deep learning models. Results show that deep learning models outperform traditional ones that in addition need a lot of hand-crafted features.
Traditional method
Feature engineering
The main idea is that we need to get the embedding vector of the sentence. There are mainly two ways, one is to synthesize the sentence embedding from word/char embeddings, and another one is to get the sentence embedding directly.
Get the sentence embedding directly:
- words_3gram_sentence_tfidf_embedding
- chars_5gram_sentence_tfidf_embedding
- words_3gram_tfidf_SVD_sentence_embed
- chars_5gram_tfidf_SVD_sentence_embed
- words_1gram_tfidf_NMF_sentence_embed (Non-Negative Matrix Factorization)
- chars_1gram_tfidf_NMF_sentence_embed
- words_1gram_tf_LDA_sentence_embed (LatentDirichletAllocation)
- chars_1gram_tf_LDA_sentence_embed
Synthesize the sentence embedding from word/char embedding.
Methods to get the word/char embedding:
- word2vec word/char embedding supplied by the host
- glove word/char embedding trained by ourselves
- decompose the tf-idf matrix using SVD
We tried three different weighting strategies.
- equally
- tf-idf linear weight
- tf-idf exp weight
So all sentence embeddings we get by synthesis:
- word2vec_word_embed_to_sentence_embed_mode_equally
- word2vec_word_embed_to_sentence_embed_mode_tf_idf_exp
- word2vec_word_embed_to_sentence_embed_mode_tf_idf_linear
- word2vec_char_embed_to_sentence_embed_mode_equally
- word2vec_char_embed_to_sentence_embed_mode_tf_idf_exp
- word2vec_char_embed_to_sentence_embed_mode_tf_idf_linear
- glove_char_embed_to_sentence_embed_mode_equally
- glove_char_embed_to_sentence_embed_mode_tf_idf_exp
- glove_char_embed_to_sentence_embed_mode_tf_idf_linear
- glove_word_embed_to_sentence_embed_mode_equally
- glove_word_embed_to_sentence_embed_mode_tf_idf_exp
- glove_word_embed_to_sentence_embed_mode_tf_idf_linear
- words_3gram_tfidf_SVD_word_embed_to_sentence_embed_mode_equally
- words_3gram_tfidf_SVD_word_embed_to_sentence_embed_mode_tf_idf_exp
- words_3gram_tfidf_SVD_word_embed_to_sentence_embed_mode_tf_idf_linear
- chars_5gram_tfidf_SVD_char_embed_to_sentence_embed_mode_equally
- chars_5gram_tfidf_SVD_char_embed_to_sentence_embed_mode_tf_idf_exp
- chars_5gram_tfidf_SVD_char_embed_to_sentence_embed_mode_tf_idf_linear
After getting the sentence embeddings, we can form features by manipulating embedding pairs. One is calculating some distance between these two embeddings, the other one is keeping the absolute difference of these two embedding vectors as features.
The distance we tried for tf-idf sentence embedding:
-
cosine_similarity $$ cos(x,y) = \frac{x^Ty}{||x||_2||y||_2} $$
-
polynomial_kernel
The polynomial_kernel computes the degree-$d$ polynomial kernel between two vectors. The polynomial kernel represents the similarity between two vectors. Conceptually, the polynomial kernels considers not only the similarity between vectors under the same dimension, but also across dimensions. When used in machine learning algorithms, this allows to account for feature interaction. The polynomial kernel is defined as: $K(x,y) = (\gamma x^Ty+c_0)^d$, where $\gamma$ defaults to $\frac{1}{|x|}$, where $|x|$ means the number of elements in $x$; $c_0$ defaults to $1$; $d$ defaults to 3. -
sigmoid_kernel
The function sigmoid_kernel computes the sigmoid kernel between two vectors. The sigmoid kernel is also known as hyperbolic tangent, or Multilayer Perceptron (because, in the neural network field, it is often used as neuron activation function). It is defined as: $K(x,y)=tanh(\gamma x^Ty+c_0)$, where $\gamma$ defaults to $\frac{1}{|x|}$, where $|x|$ means the number of elements in $x$; $c_0$ defaults to $1$; -
rbf_kernel
The function rbf_kernel computes the radial basis function (RBF) kernel between two vectors. This kernel is defined as: $K(x,y) = exp(-\gamma ||x-y||_2^2)$, where $\gamma$ defaults to $\frac{1}{|x|}$, where $|x|$ means the number of elements in $x$. -
laplacian_kernel
The function laplacian_kernel is a variant on the radial basis function kernel defined as:$K(x,y) = exp(-\gamma ||x-y||_1)$, where $||x-y||_1$ is the Manhattan distance between the input vector and $\gamma$ defaults to $\frac{1}{|x|}$, where $|x|$ means the number of elements in $x$. -
my_chi2_kernel
The chi squared kernel is given by $K(x,y)=exp(-\gamma \sum_{i}\frac{(x[i]-y[i])^2}{x[i]+y[i]})$, where $\gamma$ defaults to $1$. The data is assumed to be non-negative, so this kernel is only applied to tf-idf sentence embedding. -
euclidean
$K(x,y)=||x-y||_2$ -
cityblock
$K(x,y)=||x-y||_1$ -
WMD (Word Mover’s Distance)
Assume sentence 1 is represented as bag of words $b_1= \lbrace w_1,w_2,\dots, w_n \rbrace$ and sentence 2 is represented as bag of words $b_2=\lbrace \bar w_1, \bar w_2, \dots, \bar w_m \rbrace$. Then it is defined as $K(b1,b2)=\sum_{i=1}^{n}\min_{j}D(w_i,\bar{w}_j)$, and $D$ is a function calculating distance between two word embeddings. It could be euclidean distance normally.
The distance we tried for non-tf-idf sentence embedding:
- cosine_similarity,
- linear_kernel
Linear kernel is defined as $K(x,y)=x^Ty$ - polynomial_kernel
- sigmoid_kernel
- rbf_kernel
- laplacian_kernel
- my_chi2_kernel
- euclidean
- cityblock
The distance we tried for tf-idf sentence embedding(element is non-negative):
- cosine_similarity,
- polynomial_kernel
- sigmoid_kernel
- rbf_kernel
- laplacian_kernel
- euclidean
- cityblock
The word embeeding we adopted to calculate the WMD distance:
- words_3gram_tfidf_SVD_word_embed
- word2vec_word_embed
- glove_word_embed.csv
- chars_5gram_tfidf_SVD_char_embed.csv
- word2vec_char_embed.csv
- glove_char_embed.csv
We also calculate the absolute element-wise difference of two embeddings as features, and these features keep the original information in the embeddings. The embeddings used:
- words_3gram_tfidf_SVD_sentence_embed
- chars_5gram_tfidf_SVD_sentence_embed
- glove_char_embed_to_sentence_embed_mode_tf_idf_linear
- glove_word_embed_to_sentence_embed_mode_tf_idf_linear
- words_1gram_tf_LDA_sentence_embed
- chars_1gram_tf_LDA_sentence_embed
- chars_1gram_tfidf_NMF_sentence_embed
- words_1gram_tfidf_NMF_sentence_embed
Also we calculate some meta features like:
- F.spair_len_s1
- F.spair_len_s2
- F.spair_len_dif_abs
- F.spair_len_dif_over_max
- F.spair_len_dif_over_min
- F.spair_len_dif_over_mean
- F.levenshtein
- F.if_starts_same
- F.if_ends_same
- F.num_commom_n_gram(i) for i in range(1,n+1)
- F.jaccard_till_n_gram(n)
These are all features we used in traditional model. And the model we tried is LightGBM.
Deep learning method
We adopt the so-called siamese structure as our deep learning model. Generally speaking, siamese network consists of a pair of sub networks who share the same paramters. For a pair of questions, we pass one question through one sub net and second question through another sub net, then the outputs of the two networks are concatnated together and put in a fully connected layer. Finally the output of the sigmoid gives us the probability of that the two questions have the same meaning.
CNN
The sub network is CNN. 1D CNN for text looks like:

Max pooling along the length dimension makes sure that questions with different lengths have outputs of same length. Some key codes are attached here:
|
|
RNN
The subnetwork is two-layer bidirectional LSTM. A one-layer bidirectional RNN looks like:

I utilize the pack_padded_sequence
function from Pytorch to handle variable length inputs. Some key codes are attahced here:
|
|
CNN+RNN
Add a CNN layer before RNN layer. The CNN layer is “same padding”. Some key codes are attached here:
|
|
BIMPM
We do not include any correlation between two sub networks in previous versions. BIMPM introduces the correlation between two networks. The reference paper: Bilateral Multi-Perspective Matching for Natural Language Sentences The general architecture looks like:

The key part is the matching layer. The goal of this layer is to compare each contextual embedding (time-step) of one sentence against all contextual embeddings (time-steps) of the other sentence. First, we define a multi-perspective consine matching function $f_m$ to compare two vectors: $$m=f_m(v_1,v_2;W)$$ where $v_1$ and $v_2$ are two $d$-dimensional vectors, $W$ is a trainable parameter with the shape $l*d$, $l$ is the number of perspective and the returned value $m$ is $l$-dimensional vector, where $m_k=cosine(v_1 \odot W_k,v_2\odot W_k)$. $\odot$ is element-wise multiplication and $W_k$ is the $k_{th}$ row of $W$. Four kind of matchings are defined here:

-
Full matching $$\overrightarrow{m_i}^{full} = f_m(\overrightarrow{h_i}^p,\overrightarrow{h_N}^q;W^1)$$ $$\overleftarrow{m_i}^{full} = f_m(\overleftarrow{h_i}^p,\overleftarrow{h_1}^q;W^2)$$
-
Maxpooling-Matching $$\overrightarrow{m_i}^{max} = max_{j\in(1…N)}f_m(\overrightarrow{h_i}^p,\overrightarrow{h_j}^q;W^3)$$
$$\overleftarrow{m_i}^{max} = max_{j\in(1…N)}f_m(\overleftarrow{h_i}^p,\overleftarrow{h_j}^q;W^4)$$
-
Attentive-Matching $$\overrightarrow{a_{i,j}} = cosine(\overrightarrow{h_i}^p,\overrightarrow{h_j}^q) \ \ \ \ \ \ \ \ \ \ j=1,2,3…N$$
$$\overleftarrow{a_{i,j}} = cosine(\overleftarrow{h_i}^p,\overleftarrow{h_j}^q) \ \ \ \ \ \ \ \ \ \ j=1,2,3…N$$
$$\overrightarrow{h_i}^{mean} = \frac{\sum_j^N \overrightarrow{a_{i,j}}*\overrightarrow{h_j}^q}{\sum_j^N \overrightarrow{a_{i,j}}}$$
$$\overleftarrow{h_i}^{mean} = \frac{\sum_j^N \overleftarrow{a_{i,j}}*\overleftarrow{h_j}^q}{\sum_j^N \overleftarrow{a_{i,j}}}$$
$$\overrightarrow{m_i}^{att} = f_m(\overrightarrow{h_i}^p,\overrightarrow{h_i}^{mean};W^5)$$
$$\overleftarrow{m_i}^{att} = f_m(\overleftarrow{h_i}^p,\overleftarrow{h_i}^{mean};W^6)$$
-
Max-Attentive-Matching
This strategy is similar to the attentive-Matching strategy. However, instead of taking the weighed sum of all the contextual embeddings as the attentive vector, we pick the contextual embedding with the highest cosine similarity as the attentive vector.
We apply all these four matching strategies to each timestep of the sentence P, and concatenate the generated eight vectors as the matching vector for each time-step of P. We also perform the same process for the reverse matching direction.
Some key codes of BIMPM are attached here:
|
|
Ensemble
Finally ensemble is adopted to achieve the best score.