ACE 2005数据集(介绍2)

您所在的位置:网站首页 bibliothek词性 ACE 2005数据集(介绍2)

ACE 2005数据集(介绍2)

#ACE 2005数据集(介绍2)| 来源: 网络整理| 查看: 265

以下内容来自https://catalog.ldc.upenn.edu/LDC2006T06 ACE 2005 Multilingual Training Corpus

ACE2005多语言训练语料

Item Name:ACE 2005 Multilingual Training CorpusAuthor(s):Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki MaedaLDC Catalog No.:LDC2006T06ISBN:1-58563-376-3ISLRN:458-031-085-383-4Release Date:February 15, 2006Member Year(s):2006DCMI Type(s):Text

Data Source(s):

数据源:

weblogs, broadcast news, newsgroups, broadcast conversation

微博,广播新闻,新闻组,广播对话

Project(s):ACE

Application(s):

应用:

automatic content extraction

自动内容抽取

Language(s):

语言:

Mandarin Chinese, Standard Arabic, English

普通话中文、标准阿拉伯语、英语

Language ID(s):cmn, arb, engLicense(s):LDC User Agreement for Non-MembersOnline Documentation:LDC2006T06 DocumentsLicensing Instructions:Subscription & Standard Members, and Non-MembersCitation:Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006. Introduction

介绍

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

ACE 2005多语种培训语料库包含完整的英语、阿拉伯语和汉语训练数据,用于2005年自动内容提取(ACE)技术评估。语料库由多种类型的数据组成包括实体、关系和事件,这些数据由语言数据联盟(LDC)标注,并得到ACE计划的支持和LDC的额外援助。

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form.

ACE项目的目标是开发自动内容提取技术,用以支持人类语言文本形式的自动处理。

In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks.

2005年11月,对站点进行了五个主要方面的系统性能评估:实体的识别、值、时间表达式、关系和事件。实体、关系和事件提及检测也作为诊断任务提供。除事件任务外,所有任务均使用英语、汉语和阿拉伯语三种语言执行。事件任务任务仅用英文和中文进行评估。这个版本包括这些评价任务的官方培训数据。

For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website.

有关ACE项目语言资源的更多信息,包括注释指南、任务定义和其他文档,请参见LDC的ACE网站。

Data

数据

Below is information about the amount of data in this release and its annotation status.

下面是关于此版本中的数据量及其注释状态的信息。

1P: data subject to first pass (complete) annotation1P: 须先通过(完整)注释的资料DUAL: data also subject to dual first pass (complete) annotation对偶:数据也服从对偶第一遍(完整)注释ADJ: data also subject to discrepancy resolution/adjudicationADJ: 资料也有经争议解决/裁定NORM: data also subject to TIMEX2 normalizationNORM: 数据也要服从TIMEX2标准化 

--------------------- 对1P,DUAL,ADJ, NORM的解释(来自:原文:https://blog.csdn.net/taolusi/article/details/80812597  作者:taolusi )

adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P,与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说,一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统(Annotation Work-flow System, AWS)来进行分配的,而且文件分配是双盲的。Note:1P和DUAL在文件夹里都是以'fp1'和'fp2'来存放的,也就是说1P和fp1对应,DUAL和fp2对应。每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决,从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ(也就是我们上边说的ADJ文件夹)。在裁决之后,TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。 ---------------------   

Englishwordsfiles1PDUALADJNORM1PDUALADJNORMNW6065857807334594839912812481106BN59239581445244455967239234217226BC4661246110338744041568675260WL45210436483552937897127122114119UN4516144473263713736658573749CTS4700347003348683984546463439Total303833297185216545259889666650535599

Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.

注:中文数据以字符表示,我们假设对应大约1.5个字符/单词

charsfiles 1P(完整)注释DUAL对偶ADJ争议裁决1PDUALADJNW新闻专线127319124175121797248242238BN广播新闻134963133696120513332328298WL微博71839680636568110710197Total334121325834307991687671633 Arabicwordsfiles1PDUALADJ1PDUALADJNW612875615853026239226221BN292592716526907134128127WL216872018120181605555Total112233103504100114433409403 Samples

For examples of the data in this publication, please review the following samples:

EnglishArabicChinese


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3