SkySense-O

image
[![paper](https://img.shields.io/badge/Paper-CVPR_2025-ECA8A7?logo=readthedocs&logoColor=white)](https://openaccess.thecvf.com/content/CVPR2025/papers/Zhu_SkySense-O_Towards_Open-World_Remote_Sensing_Interpretation_with_Vision-Centric_Visual-Language_Modeling_CVPR_2025_paper.pdf) [![Dataset](https://img.shields.io/badge/Dataset-Hugging_Face-CFAFD4?logo=alfred&logoColor=white)](https://huggingface.co/zqcraft/SkySense-O/tree/main) [![Weight](https://img.shields.io/badge/Weight-Hugging_Face-C1E1C1?logo=pkgsrc&logoColor=white)](https://huggingface.co/zqcraft/SkySense-O/tree/main) [![DemoDemo](https://img.shields.io/badge/Try_Demo-Docs-6B88E3?logo=applearcade&logoColor=white)](demo/readme.md)

Introduction✨

1. SkySense Family

Welcome to SkySense-O — part of the SkySense family [homepage], a series of remote sensing foundation models for earth observation as following. We’d be delighted to have your attention and earn a star!
(1) SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
(2) SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling
(3) SkySense-V2: A Unified Foundation Model for Multi-modal Remote Sensing
(4) SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation
(5) SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

2. SkySense-O

This is a model aggregated with CLIP and SAM version of SkySense for remote sensing interpretation described in SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling. In addition to introducing a powerful remote sensing vision-language foundation model, we have also proposed the first open-vocabulary segmentation dataset in the remote sensing domain. Each ground truth (contains mask and text) in the dataset has undergone multiple rounds of annotation and validation by human experts, enabling the capability to segment anything in open remote sensing scenarios.

The primary advantage of our model, in comparison to SAM and GroundingDINO, lies in its ability to deliver output with pixel-level spatial high density and more expansive semantic labeling as following.

News 🚀

Try Our Demo 🕹️

  1. Install dependencies.
  2. Download the demo checkpoint. [ckpt]
  3. Run the demo according to the demo guide. [docs]

Dependencies and Installation

1. install detectron2
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
2. clone this repository and install dependencies
git clone https://github.com/zqcraft/SkySense-O.git
cd SkySense-O
pip install -r require.txt
pip install accelerate -U

Dataset Preparation

After downloading Sky-SA dataset, organize the data as follows in ./data,

├── Sky-SA
│   ├── img_dir
│   ├── ann_dir
│   ├── skysa_dataset.jsonl
│   ├── skysa_graph.jsonl

Model Training and Evaluation

sh run_train.sh 

To evaluate only, modify the script as follows: add --eval-only to the command in run_train.sh. The line should read: python train_net.py --eval-only. Then execute above command.

Results

Citation

@InProceedings{Zhu_2025_CVPR,
    author    = {Zhu, Qi and Lao, Jiangwei and Ji, Deyi and Luo, Junwei and Wu, Kang and Zhang, Yingying and Ru, Lixiang and Wang, Jian and Chen, Jingdong and Yang, Ming and Liu, Dong and Zhao, Feng},
    title     = {SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {14733-14744}
}

@article{wu2025semantic,
  author       = {Wu, Kang and Zhang, Yingying and Ru, Lixiang and Dang, Bo and Lao, Jiangwei and Yu, Lei and Luo, Junwei and Zhu, Zifan and Sun, Yue and Zhang, Jiahao and Zhu, Qi and Wang, Jian and Yang, Ming and Chen, Jingdong and Zhang, Yongjun and Li, Yansheng},
  title        = {A semantic‑enhanced multi‑modal remote sensing foundation model for Earth observation},
  journal      = {Nature Machine Intelligence},
  year         = {2025},
  doi          = {10.1038/s42256-025-01078-8},
  url          = {https://doi.org/10.1038/s42256-025-01078-8}
}

@inproceedings{guo2024skysense,
    author    = {Guo, Xin and Lao, Jiangwei and Dang, Bo and Zhang, Yingying and Yu, Lei and Ru, Lixiang and Zhong, Liheng and Huang, Ziyuan and Wu, Kang and Hu, Dingxiang and He, Huimei and Wang, Jian and Chen, Jingdong and Yang, Ming and Zhang, Yongjun and Li, Yansheng},
    title     = {SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {27672-27683}
}

@article{luo2024skysensegpt,
  title={Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding},
  author={Luo, Junwei and Pang, Zhen and Zhang, Yongjun and Wang, Tingzhu and Wang, Linlin and Dang, Bo and Lao, Jiangwei and Wang, Jian and Chen, Jingdong and Tan, Yihua and others},
  journal={arXiv preprint arXiv:2406.10100},
  year={2024}
}

Acknowledgement

This implementation is based on Detectron 2. Thanks for the awesome work.