🗃️ NLPre-ZH Dataset


Test datasets


The NLPre-ZH benchmark consists of a set of various linguistic tasks, including segmentation, lemmatization, morphological analysis, part-of-speech tagging, and dependency parsing, as well as a collection of manually annotated test datasets selected for evaluating NLP models performing these tasks.


NLPre-ZH employs the traditional Chinese UD treebank, referred to as UD_Chinese-GSD for evaluation of NLPre tasks. UD_Chinese-GSD was annotated and converted by Google and contains 4997 sentences split as follows:

  • test: 500 trees
  • dev: 500 trees
  • train: 3997

Test textual data


Download the zip file with the textual data to be processed.