TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

🎉🎉 Congratulations! This paper has been accepted as NeurIPS 2025 Spotlight 🌟🔥 at D&B track.
Peking University Huawei Noah's Ark Lab
TIME Dataset Overview

Introduction of TIME

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios.

TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

38,522
Total QA Pairs
943
TIME-Lite QA Pairs
3
Sub-datasets
11
Sub-tasks

Dataset Statistics

Dataset All Tasks Ext. Loc. Comp. D.C. O.C. E.R. O.R. R.R. C.T. T.L. C.F.
TIME 38522 1480 3546 3376 3401 3549 3537 3538 3537 3513 5508 3537
TIME-Wiki 13848 1261 1299 1126 1151 1299 1287 1288 1287 1263 1300 1287
TIME-News 19958 0 1800 1800 1800 1800 1800 1800 1800 1800 3758 1800
TIME-Dial 4716 219 447 450 450 450 450 450 450 450 450 450
TIME-Lite 943 60 90 78 86 90 90 90 90 90 89 90
TIME-Lite-Wiki 322 30 30 24 28 30 30 30 30 30 30 30
TIME-Lite-News 299 0 30 30 30 30 30 30 30 30 29 30
TIME-Lite-Dial 322 30 30 24 28 30 30 30 30 30 30 30

Task abbreviations: Ext. (Extract), Loc. (Localization), Comp. (Computation), D.C. (Duration Compare), O.C. (Order Compare); E.R. (Explicit Reasoning), O.R. (Order Reasoning), R.R. (Relative Reasoning); C.T. (Co-temporality), T.L. (Timeline), C.F. (Counterfactual).


Construction Pipeline

TIME Construction Pipeline

Evaluation Results

TIME-Lite Results Radar Charts

TIME-Lite-Wiki

TIME-Lite-Wiki Results

TIME-Lite-News

TIME-Lite-News Results

TIME-Lite-Dial

TIME-Lite-Dial Results

Get Started

📥 Download Dataset

Option 1: Complete TIME Dataset

# Install git-lfs
pip install git-lfs

# Navigate to the working directory and download the benchmark dataset TIME
chmod +x scripts/download_data_time.sh

# Download the data
./scripts/download_data_time.sh

Option 2: TIME-Lite Dataset (Recommended)

# Navigate to the working directory and download the benchmark dataset TIME-Lite
chmod +x scripts/download_data_time_lite.sh

# Download the data
./scripts/download_data_time_lite.sh

🔧 Installation

# Install dependencies
pip install -r evaluation/requirements.txt

🚀 Evaluation

# Evaluate TIME dataset
./scripts/eval_time.sh

# Evaluate TIME-Lite dataset (Recommended)
./scripts/eval_time_lite.sh

Citation

If you find our work interesting and meaningful, welcome to star this repo, give an upvote to our HF repo TIME and cite our paper as follows.

@article{wei2025time, title={TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios}, author={Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng}, journal={arXiv preprint arXiv:2505.12891}, year={2025} }