Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Ruixu Zhang^1*, Yuran Wang^1*, Xinyi Hu^1*, Chaoyu Mai¹, Wenxuan Liu², Danni Xu³, Xian Zhong⁴, Zheng Wang^1✉

¹School of Computer Science, Wuhan University
²Peking University
³School of Computing, National University of Singapore
⁴Wuhan University of Technology

Technical Report

Code Dataset

Overview. (a) Group Intention Forecasting task forecasts the occurrence time of group intentions by observing individual actions and interactions in early time; (b) The SHOT dataset provides 5 camera views videos and is annotated with 6 multi-individual attributes to describe the multi-level intention, including the group intention and the individual intention.

Abstract

Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting.

Dataset Pipeline

Pipeline Overview. Collection: videos are sourced from NBA highlights and full-game replays, then compiled into an unlabeled pool. Categorization: clips are classified by camera view and tactical type. Annotation: features are labeled manually or via tracking models. Structure: video annotations are stored in a JSON file with this structure. Review: annotations are reviewed and relabeled as needed.

Comparison. Comparison of the proposed SHOT dataset with existing datasets. SA: Sports Analysis, II: Individual Intention, GIF: Group Intention Forecasting.

Method

Method Overview. GIFT extracts bounding box, pose, gaze, headpose, velocity, and role features from the τ seen frames (τ ∈ {1, 2, ..., T}). The STGCN Encoder models spatial and temporal patterns. The STGCN Decoder forecasts future features, from which the shooting role is identified to determine the frame number.

Results

Experimental Results. Quantitative comparison of leading methods on SHOT. Best performances are highlighted in bold.

Supplementary Materials

Overview

This supplementary is organized as follows:

Dataset Construction Details
1. Video Clip Selection
2. View Categorization
3. Tactic Categorization
Tactic Categorization Details
1. Hierarchical tactic categorization structure of SHOT dataset
Dataset Statistics
1. Number of videos for each NBA team playing at home
2. Number of video clips per tactical categorization
3. Frequency of tactics used by home NBA teams
Additional Video Examples of SHOT

Dataset Construction

Video Clip Selection

We utilize the open-source software LosslessCut to extract shooting clips we want from the complete videos. We ensure that: (1) Each clip contains only a single offensive shooting attempt, regardless of its success. (2) All 10 players appear in the video clip, so the subsequent feature annotations cover every player.

Illustration of the video clip selection process using LosslessCut. Shooting clips are manually trimmed by identifying their start and end time points, then exported for further processing.

View Categorization

Video clips are classified into 5 distinct views based on camera angles spaced 30° apart. Specifically, View3, being the central view and a frontal perspective, covers 60°.

Illustration of View1--View5. Views 1, 2, 4, and 5 each span 30°, while View 3 covers the central 60°.

Tactic Categorization

We create a comprehensive classification of 54 tactics, capturing tactical performances from 5 different camera views. These 54 tactics thoroughly analyze and depict complex individual interactions and tactical coordination within basketball games across four progressive dimensions: pass frequency and pick&roll, which capture interactions between group members; drive, which indicates shooting intention; and shooting type, which highlights the individual features relevant to the shot.

Illustration of the tactic categorization process. Keywords observed in the video are selected and confirmed for saving.

Tactic Categorization Details

Hierarchical Structure of SHOT Dataset

Each shooting video in SHOT is categorized by four tactical dimensions: Pass Frequency (No-Pass, One-Pass, Multi-Pass), Pick&Roll (P&R) Frequency (No-P&R, One-P&R, Multi-P&R), Drive Presence (Drive, No-Drive), and Shot Type (Shoot, Layup, Dunk). Their combinations define 54 distinct tactical scenes (3 × 3 × 2 × 3), capturing diverse basketball strategies.

Dataset Statistics

We count the clips by the home team to aid further analysis of tactical characteristics for different teams. The analysis reveals interesting patterns in tactical preferences across different NBA divisions.

NBA Teams Distribution

Number of videos for home NBA teams. Colors indicate different NBA divisions, with each box labeled by the team's abbreviation.

Tactical Statistics

Number of video clips per tactical categorization. "1" and "M" denote "One" and "Multi", respectively; "P" and "D" represent "Pass" and "Drive"; while "ST", "LY", and "DK" correspond to "Shoot", "Layup", and "Dunk".

Team Tactical Preferences

The relation between team information and tactical data reveals interesting patterns. For example, the Southeast Division favors the Multi-P&R tactic more than the Southwest Division, which supports the rationale for our dataset's emphasis on tactical information for predicting shot intention.

Additional Video Examples

The examples are selected from various camera views and tactic combinations. Each video clip is represented by multiple frames, illustrating the shooting progression and demonstrating the diversity of scenarios captured in the SHOT dataset.

Additional video examples from SHOT dataset. Examples are selected from various camera views and tactic combinations. Each video clip is represented by multiple frames, illustrating the shooting progression.