different preparation very different result!

Hi,

Thanks for sharing the code. Id like to raise a new issue here. When I use the dataset you have shared I get the almost same coherence score, however, when I use the original 20newsgroup dataset and follow the same preparation shared in the paper I get very different result.

I Really appreciate if you can share your idea. I can share the preparation script I used.

The coherence score is 0.09, is that true to say that the model is dependent on the data?

I cannot come up with a justification for that as I used the same preparation (tokenized, remove stop words, 2000 most frequent word, and create the vector based on the freq).

The reason that I needed to prepare my data is that the data u shared does not have the label associated with the files. I needed to generate the topics and have the associated labels to pass to a classification to test the accuracy of the model.

I appreciate it if you can help on this.

I  preproccesed the data as mentioned here: https://github.com/hugochan/KATE/blob/master/construct_20news.py (Without stemming and normalizing the data)

Or if you can share the script you used to preprocess the data that one also help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

different preparation very different result! #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

different preparation very different result! #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions