🗃️ NLPre-ZH Dataset

Test datasets

The NLPre-ZH benchmark consists of a set of various linguistic tasks, including segmentation, lemmatization, morphological analysis, part-of-speech tagging, and dependency parsing, as well as a collection of manually annotated test datasets selected for evaluating NLP models performing these tasks.

NLPre-ZH employs the traditional Chinese UD treebank, referred to as UD_Chinese-GSD for evaluation of NLPre tasks. UD_Chinese-GSD was annotated and converted by Google and contains 4997 sentences split as follows:

test: 500 trees
dev: 500 trees
train: 3997

Test textual data

Download the zip file with the textual data to be processed.