AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

Luo, Ling; Jiang, Wenbin; Chang, Hongyuan; Wang, Xinkang; Xiong, Yueting; Tong, Mengsha; Yu, Rongshan

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

Ling Luo^♠, Wenbin Jiang^♠, Hongyuan Chang, Xinkang Wang, Yueting Xiong^♥, Mengsha Tong^♥, Rongshan Yu^♥

Xiamen University
ICLR 2026
^♠Indicates Equal Contribution

Paper 🤗 Dataset Code arXiv

Abstract

Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

Dataset Construction

AFD-Instruction is constructed through a multi-agent pipeline that collects antibody data from PDB/SAbDab and PubMed, extracts functional annotations via coordinated agents (Mr. Extractor, Dr. Mechanism, and Prof. Function), and enables two downstream tasks: antibody sequence understanding and function-guided antibody design.

Data Distribution

Dataset overview: (a) rose plot of instruction counts; (b) combined lengths and word analysis. The heavy-chain variable regions display a near-unimodal distribution centered around 110–125 amino acids, while light-chain variable regions are generally shorter. Antigen sequences display a broader, multimodal distribution. The corpus focuses on mechanistic/structural terms and target-focused nouns, spanning molecular to phenotypic descriptors.

Case Study

Case studies of instruction-sensitive CDRH3 design. For a fixed antibody scaffold, we vary only the natural language specification of antigen, mechanism, and epitope, and observe prompt-dependent changes in the designed CDRH3 sequences and their structural/biophysical scores. These differences indicate that the model adapts its designs to the functional semantics expressed in the instructions rather than producing generic CDRH3 loops.

Results

Antibody Sequence Understanding - Classification

Performance comparison on antibody sequence understanding tasks (classification). QwenAB and LLaMAB achieve state-of-the-art results across all subtasks including class prediction, disease association, binding prediction, mechanism inference, and functional annotation.

Antibody Sequence Understanding - Caption

Performance on caption tasks requiring free-form textual answers. AFD-Instruction-tuned models demonstrate superior performance across all evaluated metrics (BLEU, ROUGE, METEOR) compared to baseline models.

Function-Guided Antibody Design

Evaluation of antibody design quality. AFD-Instruction-tuned models achieve higher pTM, ipTM, and pLDDT scores, indicating enhanced stability, improved inter-chain packing accuracy, and increased confidence in predicted 3D conformations of designed antibody structures.

BibTeX

@inproceedings{luo2026afd,
    title={AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design},
    author={Ling Luo and Wenbin Jiang and Hongyuan Chang and Xinkang Wang and Yueting Xiong and Mengsha Tong and Rongshan Yu},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2026}
  }