I'm working on review data where each review is labeled as positive or negative. My aim is to find topics in those reviews which are perceived either positive or negative. Therefore I performed a LDA on this data where each input row is either a positive or negative review, i.e. my documents are of the two types 'positive review' or "negative review". The output of the LDA gives me rather diverse words within each topic, which are not easy to interpret.
However, when summing up the assigned probabilities for positive and negative documents, some of the topics receive clearly more probability mass from either positive or negative documents. I see this as an indicator that the topics are able to separate positive and negative documents.
In order to better interpret the output, i wanted to map the top 10 words in each topic of each document to the real vector space with w2v. This is done separately for the positive and negative documents. Then i want to do a k-means clustering on the two sets of vectors. My hope is that i will get clean interpretable clusters where i can take each centroid as a topic. The role of the LDA would then be a dimensionality reduction to a few words instead of the full review before i can do the clustering.
Does that approach make sense? What other alternatives would you suggest?