TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

🎉🎉 Congratulations! This paper has been accepted as NeurIPS 2025 Spotlight 🌟🔥 at D&B track.

Project Leads: Shaohang Wei (shaohang@stu.pku.edu.cn)

Paper Code

🤗

TIME Dataset

⚡

TIME-Lite

Introduction of TIME

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios.

TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

38,522

Total QA Pairs

943

TIME-Lite QA Pairs

Sub-datasets

Sub-tasks

Dataset Statistics

Dataset	All Tasks	Ext.	Loc.	Comp.	D.C.	O.C.	E.R.	O.R.	R.R.	C.T.	T.L.	C.F.
TIME	38522	1480	3546	3376	3401	3549	3537	3538	3537	3513	5508	3537
TIME-Wiki	13848	1261	1299	1126	1151	1299	1287	1288	1287	1263	1300	1287
TIME-News	19958	0	1800	1800	1800	1800	1800	1800	1800	1800	3758	1800
TIME-Dial	4716	219	447	450	450	450	450	450	450	450	450	450
TIME-Lite	943	60	90	78	86	90	90	90	90	90	89	90
TIME-Lite-Wiki	322	30	30	24	28	30	30	30	30	30	30	30
TIME-Lite-News	299	0	30	30	30	30	30	30	30	30	29	30
TIME-Lite-Dial	322	30	30	24	28	30	30	30	30	30	30	30

Task abbreviations: Ext. (Extract), Loc. (Localization), Comp. (Computation), D.C. (Duration Compare), O.C. (Order Compare); E.R. (Explicit Reasoning), O.R. (Order Reasoning), R.R. (Relative Reasoning); C.T. (Co-temporality), T.L. (Timeline), C.F. (Counterfactual).

Construction Pipeline

Evaluation Results

TIME-Lite Results Radar Charts

TIME-Lite-Wiki

TIME-Lite-News

TIME-Lite-Dial

Get Started

📥 Download Dataset

Option 1: Complete TIME Dataset

# Install git-lfs
pip install git-lfs

# Navigate to the working directory and download the benchmark dataset TIME
chmod +x scripts/download_data_time.sh

# Download the data
./scripts/download_data_time.sh

Option 2: TIME-Lite Dataset (Recommended)

# Navigate to the working directory and download the benchmark dataset TIME-Lite
chmod +x scripts/download_data_time_lite.sh

# Download the data
./scripts/download_data_time_lite.sh

🔧 Installation

# Install dependencies
pip install -r evaluation/requirements.txt

🚀 Evaluation

# Evaluate TIME dataset
./scripts/eval_time.sh

# Evaluate TIME-Lite dataset (Recommended)
./scripts/eval_time_lite.sh

Citation

If you find our work interesting and meaningful, welcome to star this repo, give an upvote to our HF repo TIME and cite our paper as follows.

@article{wei2025time, title={TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios}, author={Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng}, journal={arXiv preprint arXiv:2505.12891}, year={2025} }

This benchmark is designed to advance temporal reasoning capabilities in Large Language Models.

For questions or collaboration, please contact: shaohang@stu.pku.edu.cn