Skip to content

Script for dividing corpus into different parts #1

@nikopartanen

Description

@nikopartanen

I would need a script that takes an UD corpus and splits it into different parts. Ideally it would be specifiable either as proportions, i.e. 10/90, or as or by sentence numbers 100/900 etc. Other ideas are welcome too. I was assuming there is already something like this as in some point the corpora are split, but I didn't find anything. I don't know what's the typical way to do the split, but one useful possibility could be to specify whether the split is done by randomly selecting sentences or by taking consecutively from top, for example.

Just to give background, I would use it within a loop that creates different sized subsets from the corpus, after one portion would had been extracted as unchanging test portion. So I'm not totally sure how this is most generalizable for different uses, but this sounds like a generally useful idea anyway, so I'm adding it here.

Thanks a lot for many useful scripts here!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions