Welcome to SkySense-O — part of the SkySense family [homepage], a series of remote sensing foundation models for earth observation as following. We’d be delighted to have your attention and earn a star!
(1) SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
(2) SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling
(3) SkySense-V2: A Unified Foundation Model for Multi-modal Remote Sensing
(4) SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation
(5) SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
This is a model aggregated with CLIP and SAM version of SkySense for remote sensing interpretation described in SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling. In addition to introducing a powerful remote sensing vision-language foundation model, we have also proposed the first open-vocabulary segmentation dataset in the remote sensing domain. Each ground truth (contains mask and text) in the dataset has undergone multiple rounds of annotation and validation by human experts, enabling the capability to segment anything in open remote sensing scenarios.

The primary advantage of our model, in comparison to SAM and GroundingDINO, lies in its ability to deliver output with pixel-level spatial high density and more expansive semantic labeling as following.
2025/02/27: 🔥 SkySense-O has been accepted to CVPR2025 !2025/04/08: 🔥 We introduce SkySense-O, demonstrating impressive zero-shot capabilities on a thorough evaluation encompassing 14 datasets, from recognizing to reasoning and classification to localization. Specifically, it outperforms the latest models such as SegEarth-OV, GeoRSCLIP, and VHM by a large margin, i.e., 11.95\%, 8.04\% and 3.55\% on average respectively.2025/06/10: 🔥 We release the training and evaluation code.2025/06/11: 🔥 We release the checkpoints and demo. Welcome to try!2025/06/17: 🔥 We release the checkpoints of SkySense-CLIP[ckpt] for future research.2025/06/29: 🔥 We release the Sky-SA dataset[dataset] .2025/08/06: 🔥 Our new work, SkySense++[paper][code], has been accepted to Nature Machine Intelligence! Different from text prompt of Skysense-O, SkySense++ focuses on visual prompt.2025/08/08: 🔥 The SkySense family homepage is now live. Welcome to follow us!python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
git clone https://github.com/zqcraft/SkySense-O.git
cd SkySense-O
pip install -r require.txt
pip install accelerate -U
After downloading Sky-SA dataset, organize the data as follows in ./data,
├── Sky-SA
│ ├── img_dir
│ ├── ann_dir
│ ├── skysa_dataset.jsonl
│ ├── skysa_graph.jsonl
sh run_train.sh
To evaluate only, modify the script as follows: add --eval-only to the command in run_train.sh. The line should read: python train_net.py --eval-only.
Then execute above command.
@InProceedings{Zhu_2025_CVPR,
author = {Zhu, Qi and Lao, Jiangwei and Ji, Deyi and Luo, Junwei and Wu, Kang and Zhang, Yingying and Ru, Lixiang and Wang, Jian and Chen, Jingdong and Yang, Ming and Liu, Dong and Zhao, Feng},
title = {SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {14733-14744}
}
@article{wu2025semantic,
author = {Wu, Kang and Zhang, Yingying and Ru, Lixiang and Dang, Bo and Lao, Jiangwei and Yu, Lei and Luo, Junwei and Zhu, Zifan and Sun, Yue and Zhang, Jiahao and Zhu, Qi and Wang, Jian and Yang, Ming and Chen, Jingdong and Zhang, Yongjun and Li, Yansheng},
title = {A semantic‑enhanced multi‑modal remote sensing foundation model for Earth observation},
journal = {Nature Machine Intelligence},
year = {2025},
doi = {10.1038/s42256-025-01078-8},
url = {https://doi.org/10.1038/s42256-025-01078-8}
}
@inproceedings{guo2024skysense,
author = {Guo, Xin and Lao, Jiangwei and Dang, Bo and Zhang, Yingying and Yu, Lei and Ru, Lixiang and Zhong, Liheng and Huang, Ziyuan and Wu, Kang and Hu, Dingxiang and He, Huimei and Wang, Jian and Chen, Jingdong and Yang, Ming and Zhang, Yongjun and Li, Yansheng},
title = {SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {27672-27683}
}
@article{luo2024skysensegpt,
title={Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding},
author={Luo, Junwei and Pang, Zhen and Zhang, Yongjun and Wang, Tingzhu and Wang, Linlin and Dang, Bo and Lao, Jiangwei and Wang, Jian and Chen, Jingdong and Tan, Yihua and others},
journal={arXiv preprint arXiv:2406.10100},
year={2024}
}
This implementation is based on Detectron 2. Thanks for the awesome work.