-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Hi,
Thanks for sharing the code. Id like to raise a new issue here. When I use the dataset you have shared I get the almost same coherence score, however, when I use the original 20newsgroup dataset and follow the same preparation shared in the paper I get very different result.
I Really appreciate if you can share your idea. I can share the preparation script I used.
The coherence score is 0.09, is that true to say that the model is dependent on the data?
I cannot come up with a justification for that as I used the same preparation (tokenized, remove stop words, 2000 most frequent word, and create the vector based on the freq).
The reason that I needed to prepare my data is that the data u shared does not have the label associated with the files. I needed to generate the topics and have the associated labels to pass to a classification to test the accuracy of the model.
I appreciate it if you can help on this.
I preproccesed the data as mentioned here: https://github.com/hugochan/KATE/blob/master/construct_20news.py (Without stemming and normalizing the data)
Or if you can share the script you used to preprocess the data that one also help.