I am running LDA on health-related data. Specifically I have ~500 documents that contain interviews that last around 5-7 pages. Other than that, I cannot really go into the details of the data due to preserving data integrity/confidentiality. Although I unfortunately cannot show the results output, I will describe the results and then go through my procedure to give a better idea of what I am doing and where I can improve.
For the results, I chose 20 topics and outputted 10 words per topic. Although 20 was somewhat arbitrary and I did not have a clear idea of a good amount of topics, that seemed like a good amount given the size of the data and that they are all health-specific. However, the results highlighted two issues: 1) it is unclear what the topics were since the words within each topic did not necessarily go together or tell a story and 2) many of the words among the various topics overlapped, and there were a few words that showed up in most topics.
In terms of what I did, I first preprocessed the text. I converted everything to lowercase, removed punctuation, removed unnecessary codings specific to the set of documents at hand. I then tokenized the documents, lemmatized the words, and performed tf-idf. I used sklearn's tf-idf capabilities and within tf-idf initialization, I specified a customized list of stopwords to be removed (which added to nltk's set of stopwords). I also set max_df to 0.9 (unclear what a good number is, I just played around with different values), min_df to 2, and max_features to 5000. I tried both tf-idf and bag of words (count vectorizer), but I found tf-idf to give slightly clearer and more distinct topics while analyzing the LDA output. After this was done, I then ran an LDA model. I set the number of topics to be 20 and the number of iterations to 5.
From my understanding, each decision I made above may have contributed to the LDA model's ability to identify clear, meaningful topics. I know that text processing plays a huge role in LDA performance, and the better job I do there, the more insightful the LDA will be. Is there anything glaringly wrong or something I missed out. Do you have any suggested values/explorations for any of the parameters I described above? How do I determine a good number of topics and iterations during the LDA step? How do I go about validating performance, other than qualitatively comparing output?
I appreciate any advice or suggestions that you have! I am completely new to the area of topic modeling and while I have read some articles, I have a lot to learn! Thank you!
I think your preprocessing steps are fine, there isn't too much you can do there... I think that 5 iterations is way too less, whenever I use LDA, I give it at least a 100 iterations. This is because LDA usually takes a really long time to converge, even when using the Variational Bayes algorithm. I would also suggest trying to not reduce the number of dimensions to 5000, depending on the use case, it may or may not be a good idea... You can try running LDA without any dimensionality reduction, and see what you get. Be careful about the max_df and min_df parameters in sklearn's TfidfVectorizer. When you use a float between 0 and 1, it corresponds to the fraction of documents, and when you use an integer, it corresponds to the absolute counts. Keep track of how perplexity changes over time while training, and decide on the number of iterations accordingly.