
Path_dataset = os.path.join(os.path.dirname(os. You can find some of these metrics here but be aware that some of these might need labels to judge the quality of the generated clusters.įrom sklearn.feature_extraction.text import TfidfVectorizerįrom import CoherenceModelįrom bertopic.vectorizers import ClassTfidfTransformerįrom sentence_transformers import SentenceTransformerįrom bertopic.representation import KeyBERTInspired Thus, in order to have a good model, you will need good clusters. The assumption here is that good clusters lead to good topic representations. Essentially, BERTopic is a clustering algorithm with a topic representation on top. One thing that might be interesting to look at is clustering metrics. Especially NPMI and Topic Diversity are frequently used metrics as a proxy of the "quality" of these topic modeling techniques. The trend generated by UMass metric shows improved topic coherence and also better cluster. That has happened to me more times than I would like to admit! The metrics that you find in the paper and in OCTIS are, at least in my experience, the most common metrics that you see in academia. The average elimination of incoherent aspects was found to be 28.84. So anything you suggest that is not referenced there would be super. Of course right after writing this I remembered that I hadn't gone back to the paper the OCTIS people wrote OCTIS: Comparing and Optimizing Topic models is Simple!!. coherencemodel import CoherenceModel topic_model = BERTopic( verbose = True, n_gram_range =( 1, 3)) Topic_model = BERTopic(verbose=True, embedding_model=embedder, n_gram_range=(1,1), calculate_probabilities=True) I get the coherence value, that in this case was 0.1725 for 'c_v', -0.2662 for c_npmi, and -8.5744 for u_mass.įrom bertopic import BERTopic import gensim. To evaluate the coherence of learned topics, we calculate the Normalized. When I considere n_gram_range=(1,1) like this Topic models are generated using Latent Dirichlet Allocation (LDA). My Bertopic model got topics with ngrams from 1 to 10 and the tokenizer here got tokens with only one term (1-gram). Hello MaartenGr, I tried to execute this, but the problem is the tokenizer. # Evaluate coherence_model = CoherenceModel( topics = topic_words,Ĭoherence = coherence_model. Topic_words = įor topic in range( len( set( topics)) - 1)] # Extract features for Topic Coherence evaluation words = vectorizer. # Extract vectorizer and tokenizer from BERTopic vectorizer = topic_model. coherencemodel import CoherenceModel # Preprocess documents cleaned_docs = topic_model.
