🗃️ NLPre-ZH Dataset
Test datasets
The NLPre-ZH benchmark consists of a set of various linguistic tasks, including segmentation, lemmatization, morphological analysis, part-of-speech tagging, and dependency parsing, as well as a collection of manually annotated test datasets selected for evaluating NLP models performing these tasks.
NLPre-ZH employs the traditional Chinese UD treebank, referred to as UD_Chinese-GSD for evaluation of NLPre tasks. UD_Chinese-GSD was annotated and converted by Google and contains 4997 sentences split as follows:
- test: 500 trees
- dev: 500 trees
- train: 3997
Test textual data
Download the zip file with the textual data to be processed.