LDA corpus stability

by marin   Last Updated September 19, 2018 14:19 PM

I am working on an LDA to find out topics from a corpus. Everything works fine, but when I add some lines in my corpus, the topics change a lot. This should not happen because the added topics correspond to only 0.1% of the total corpus... I'm just trying to understand what's happening...

import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_components = 5
n_top_words = 10

def print_top_words(model, feature_names, n_top_words):
    out_list = []
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic %d: " % topic_idx #aqui que tem que mudar para consertar a saida
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

        out_list.append(message.split())
    return out_list

text = pandas.read_csv('corpus.csv', encoding = 'utf-8')
get_text = text['phrases']
get_text_list = get_text.values.tolist()

# Use tf (raw term count) features for LDA.
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(get_text_list)
tf_feature_names = tf_vectorizer.get_feature_names()

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                            learning_method='online',
                            learning_offset=50.,
                            random_state=0).fit(tf)

topics = print_top_words(lda, tf_feature_names, n_top_words)

print(topics)


Related Questions






Naive Bayes for user preference prediction?

Updated July 17, 2017 00:19 AM