Skip to content

Conversation

@stephantul
Copy link

The stemming algorithm in the original model was applied after the original terms were indexed. This resulted in indexing errors if two different terms had the same stem.

Before:

_, y = d.process_document("hello hellos")
print(y)
# {"hello": 2}

Now:

_, y = d.process_document("hello hellos")
print(y)
# {"hello": 1}

This raises scores on Nanobeir a tiny bit.

@stephantul stephantul changed the base branch from master to wandb July 15, 2025 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant