Brief Introduction


Latent Dirichlet Allocation(LDA) is a topic model that explain the data’s similiarity from the unobserved groups. In our profect, LDA would divide all the words from documents into different latent topics, which would help us to better understand and have a general view of the ideas that is inside the comments or papers.

Model Specification


The entire set of our text data is called corpus. We have M documents in the corpus, and decide the number of topics K before we implement our model. Every document m has \(N_{m}\) words and the total number of different words we would have is V, which is basically the number of words in the dictionary.


The distribution we would use in our model are:

Dirichlet distribution with d dimensional parameter \(\overrightarrow{\alpha}=(\alpha_{1}, \alpha_{2}, ..., \alpha_{d})\), \(d\geqslant2\) has probability density function(PDF) in \(R^{d}\):

\(P(\overrightarrow{x}|\overrightarrow{\alpha})=\frac{\prod_{i=1}^{d}\Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{d}\alpha_{i})}\prod_{i=1}^{d}x_{i}^{\alpha_{i}-1}\)

where \(\Gamma(\alpha_{i})\) is the Gamma function that \(\Gamma(\alpha_{i})=\int_0^\infty x^{\alpha_{i}-1}e^{-x}\,\mathrm{d}x\)

Multinomial distribution with parameters n>0, the integer number of trail, d dimensional parameter \(\overrightarrow{p}=(p_{1}, p_{2}, ..., p_{d})\), the probability that each event happan, has PDF in \(R^{d}\):

\(P(\overrightarrow{x}|n, \overrightarrow{p})=\frac{n!}{x_{1}!\dots x_{d}!}p_{1}^{x_{1}}\dots p_{d}^{x_{d}}\)






Above is the general picture(from wikipedIa) for the model.


variables

M, the total number of documents in corpus;

K, the total number of topics inside the corpus;

V, the total number of words in the vocabulary;

\(N_{m}\), the total number of words in the m-th document;

\(\alpha\), K dimensional vector, is the parameter for the Dirchlet distribution, which is the prior for the topic distribution in a document;

\(\beta\), V dimensional vector, is the parameter for the Dirchlet distribution, which is the prior for the word distribution in a topic;

\(\theta_{m}\), K dimensional vector, is the parameter for the Multinomial distribution, which is the topic distribution for docment m;

\(\varphi_{k}\). V dimensional vector, is the parameter for the Multinomial distribution, which is the word distribution for topic k;

\(z_{mn}\), a number \(k\in \{1,2,\dots, K\}\), is the specific topic for the n-th word in document m;

\(w_{mn}\) is the specific n-th word in document m.

Note the only observable variables are \(w_{mn}\), which is the corpus.


Process
step 1
Generate \(\theta_{m}\sim Dir(\alpha)\), \(\varphi_{k}\sim Dir(\beta)\)

As the result, We would have two matrices;

a M by K dimensional matrix \(\Theta\) with \(\theta_{m}\) in the m-th row and \(\Theta_{mk}\) entry means the probabilty that the k-th topic has in document m;

a K by V dimensional matrix \(\varPsi\) with \(\varphi_{k}\) in the k-th row and \(\varPsi_{kv}\) entry means the probability that the v-th word in dictionary has in topic k.

step 2
For the m-th document, n-th word, decide the topic from the Multinomial distribution with parameter \(\theta_{m}\), n=1, i.e. \(z_{mn}\sim Mult(1,\theta_{m})\)
As the result, we will have M vectors with length \(N_{1}, N_{2}, \dots, N_{M}\), call the set Z.

step 3
Decide the n-th word in m-th document from the Multinomial distribution with parameter \(\varphi_{z_{mn}}\), n=1, i.e. \(w_{mn}\sim Mult(1, \varphi_{z_{mn}})\)
As the result, we will have M vectors with length \(N_{1}, N_{2}, \dots, N_{M}\), call the set W.

Inference


The total joint probability density function would be:

\(P(W,Z,\Theta,\varPsi|\alpha,\beta)=\prod_{k=1}^{K}P(\varphi_{k}|\beta)\prod_{m=1}^{M}(P(\theta_{m}|\alpha)\prod_{n=1}^{N_m}P(z_{mn}|\theta_{m})P(w_{mn}|\varphi_{z_{mn}}))\)


The posteriror we want to obtain:

\(P(z_{mn}|w_{mn},\theta_{m}, \varPsi)\propto P(z_{mn}|\theta_{m})P(w_{mn}|\varphi_{z_{mn}})\)

\(P(\theta_{m}|W, Z, \alpha, \varPsi)\propto P(\theta_{m}|\alpha)\prod_{n=1}^{N_{m}}P(z_{mn}|\theta_{m})P(w_{mn}|\varphi_{z_{mn}})\)

\(P(\varphi_{k}|W, Z, \beta, \Theta)\propto P(\varphi_{k}|\beta)\prod_{m=1}^{M}\prod_{n=1}^{N_m}I(P(z_{mn}|\theta_{m})P(w_{mn}|\varphi_{z_{mn}});z_{mn}=k)\)

where \[ \begin{equation*} I(P(z_{mn}|\theta_{m})P(w_{mn}|\varphi_{z_{mn}});z_{mn}=k)= \begin{cases} P(k|\theta_{m})P(w_{mn}|\varphi_{k}) & \text{if $z_{mn}=k$,} \\\\ 1 &\text{if $z_{mn}\neq k$.} \end{cases} \end{equation*} \]



By using Gibbs sampling or EM algprithm, we could find these posteriors.

And finally we could look at the posterior distribution of the topics in different documents and the posterior distribution of the words in different topics.

Tune parameter

Choose the right number of clusters is also our objective in this project. Theoretically, we would like to choose the number of the cluster that represent with most information. Perplexity is an ideal statistic measurement.
Perplexity is a statistical measure of how well a probability model predicts a sample. As applied to LDA, for a given value of , you estimate the LDA model. The model with the lowest perplexity is generally considered the “best”. Let’s estimate a series of LDA models on the Associated Press dataset.





From this plot, we can find although the train perplexity decreases as K (number of cluster) increases, for minimizing test perplexity the best choice is K equal to 12. Under unsupervised learning, perplexity is the criterion to determine the performance of the model. It determines how well the model can be used in new data set.







From this plot, we can find although the train perplexity decreases as K (number of cluster) increases, for minimizing test perplexity the best choice is K equal to 12. Under unsupervised learning, perplexity is the criterion to determine the performance of the model. It determines how well the model can be used in new data set.

From a statistic perspective, the choice of number K is guaged by analyzing the test perplexity. However, in practice, we try different combination of K and see how each topics contain the key words. Idealy, we would like the top words to be distinct. In this project, we choose K=3 for positive and negative comments.