Skip to content

different preparation very different result! #18

@benam2

Description

@benam2

Hi,

Thanks for sharing the code. Id like to raise a new issue here. When I use the dataset you have shared I get the almost same coherence score, however, when I use the original 20newsgroup dataset and follow the same preparation shared in the paper I get very different result.

I Really appreciate if you can share your idea. I can share the preparation script I used.

The coherence score is 0.09, is that true to say that the model is dependent on the data?

I cannot come up with a justification for that as I used the same preparation (tokenized, remove stop words, 2000 most frequent word, and create the vector based on the freq).

The reason that I needed to prepare my data is that the data u shared does not have the label associated with the files. I needed to generate the topics and have the associated labels to pass to a classification to test the accuracy of the model.

I appreciate it if you can help on this.

I preproccesed the data as mentioned here: https://github.com/hugochan/KATE/blob/master/construct_20news.py (Without stemming and normalizing the data)

Or if you can share the script you used to preprocess the data that one also help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions