MAssive eVENt detection (MAVEN) dataset is a massive ED dataset developed in 2020 by combining machine generation and human-annotation based on 4,480 Wikipedia documents. It aims at addressing limitations of existing ED datasets about data scarcity and low coverage of event types. The event types in MAVEN are derived from the frames defined in the linguistic resource FrameNet with a large coverage of events in the general domain. Compared with existing datasets, MAVEN covers 168 event types, 111,611 different events, and 118,732 events mentions, which indicates a larger data scale and a larger event coverage. Recently, it has been used for developing different ED models.

Roles Across Multiple Sentences (RAMS) is a crowdsourced dataset developed for identifying explicit arguments of different roles for an event from multiple sentences, which is known as multi-sentence argument linking. It covers 139 event types, 9,124 annotated events from 3,993 news articles, and 65 roles. Compared with prior small-scale datasets for cross-sentence argument linking, RAMS advances the development of advanced deep learning models for this task.

Automatically Labeled Data Generation for large scale event extraction (ALDG): is an automatically generated dataset using distant supervision by jointly using the world knowledge Freebase and linguistic knowledge FrameNet. ALDG covers 72,611 events across 6.3 million articles in Wikipedia and 21 event types with a focus on the topic about education, military, and sports.

WIKIEVENTS is a benchmark dataset (English) built in 2021 for the document-level event extraction task. It has complete events and relative annotations on 246 documents. The event annotation task followed established ontology from the KAIROS project. The number of events types in training, development, and testing sets are 49, 35, and 34, respectively. And the number of annotated sentences are 5262, 378, and 492, respectively. Compared to ACE, the WIKIEVENTS dataset has a much richer event ontology, especially for argument roles.

CASIE is an event extraction dataset which aims at detecting cybersecurity events. This dataset consists of 1000 English news articles from 2017-2019 with event-based annotations that covers both cyberattack and vulnerability-related events. There are 5 event types, including 3 for cyberattack (i.e., Databreach, Ransom, and Phishing) and 2 for vulnerability (i.e., Discover Vulnerability and Patch Vulnerability).

Commodity news corpus is a collection of annotated commodity news articles which consists of 21 entity types, 18 event types, and 19 argument role types. This corpora contains 8,850 entity mentions and 3,949 events.

LitBank is an annotated dataset of 100 works of English-language fiction to support tasks in natural language processing and the computational humanities. LitBank currently contains annotations for entities, events (7,849 events), entity coreference, and quotation attribution in a sample of ~2,000 words from each of those texts, totaling 210,532 tokens.