Masked Self-Attention Fusion Network for Joint Classification of Hyperspectral and LiDAR Data

Lulu Shi

Chunchao Li

Zhengchao Zeng

Puhong Duan

Behnood Rasti

Antonio Plaza

January 12, 2026

Hyperspectral imaging (HSI) captures abundant spectral information of land covers while light detection and ranging (LiDAR) provides elevation and structural characteristics. Joint classification of HSI and LiDAR data can effectively merge spectral and elevation information to enhance the outcome of land cover classification. Current HSI and LiDAR joint classification approaches mainly employ a three-layer deep network to extract high-order features, followed by a concatenation or weighted fusion scheme which cannot fully exploit the unique properties of different data modalities. Meanwhile, these methods usually require high computational resources. To alleviate these issues, this paper proposes a masked self-attention fusion network (MSAF) for joint HSI and LiDAR classification, where a cascaded cross-attention fusion framework is designed to fully merge different stages of features. First, a mobile convolution block is developed to extract multi-modal data features. Then, a multi-view sequence embedding method is proposed to effectively integrate elevation information and spectral-spatial information so as to obtain token sequences. Finally, an effective masked self-attention mechanism is designed to fuse token sequences. Experimental results on multiple datasets indicate that the proposed framework significantly outperforms other advanced multi-modal fusion methods in terms of classification performance and computing efficiency. The code of this manuscript is available on github.com/lulushh/MSAF.

https://doi.org/10.1109/tip.2025.3648926

BIFOLD AUTHORS

Dr. Behnood Rasti