# Domain Annotation Pipeline: BFVD — Release v0.2 This repository contains the **second public data release** of the [UCL Orengo Group domain-annotation-pipeline](https://github.com/UCLOrengoGroup/domain-annotation-pipeline). The pipeline integrates predicted protein structures with domain boundary detection, structural annotations, and functional classification, producing a domain-level dataset derived from viral proteomes. --- ## Changelog - **v0.2 — Second data release (Dec 2025)** - Updated pipeline version and domain assignments - Expanded structural and taxonomic annotations - Revised quality metrics (e.g., `domqual`, `dom_single_domain`) - Minor schema updates to match the current Nextflow workflow - **v0.1 — Draft results (Sep 2025)** --- ## Overview of the Pipeline 1. **Input structures** Viral protein structural models from **AlphaFold** (UniProt / BFVD) as zip file. 2. **Domain boundary prediction** Three independent domain boundary prediction methods are applied. A **consensus domain definition** is inferred from agreement between predictors. 3. **Domain extraction** Consensus domains are **chopped** from the full AlphaFold models and stored as individual PDB files. 4. **Domain annotation** Each chopped domain is annotated with structural and functional properties, including: - Secondary structure composition - Packing density and radius of gyration - AlphaFold pLDDT–derived quality information - Fold classification (via **Foldseek** and **CATH**) - Domain quality metrics and consensus confidence - Viral taxonomic metadata --- ## Example Results File (v0.2) Results are provided as a **tab-separated value (`.tsv`)** file. Each row corresponds to one **consensus structural domain** derived from a viral protein. ### Columns | Column | Description | |--------|-------------| | **uniprot_id** | UniProt identifier with domain index suffix (e.g. `B5BTU2_01`) | | **md5_domain** | MD5 checksum of the chopped domain PDB (unique domain identifier) | | **consensus_level** | Confidence in consensus domain boundary (`high`, `med`, etc.) | | **chopping** | Domain residue range(s), including multi-segment definitions | | **nres_domain** | Number of residues in the chopped domain | | **num_segments** | Number of continuous segments in the domain | | **num_helix_strand_turn** | Total secondary structure elements | | **num_helix** | Number of α-helices | | **num_strand** | Number of β-strands | | **num_helix_strand** | Count of helices + strands | | **num_turn** | Number of turns | | **packing_density** | Measure of structural compactness | | **normed_radius_gyration** | Domain radius of gyration normalized by length | | **avg_plddt** | Average pLDDT score for the domain | | **proteome_id** | Identifier for the source viral proteome | | **tax_common_name** | Common name of the source species/virus | | **tax_scientific_name** | Scientific name | | **tax_lineage** | Full taxonomic lineage string | | **domqual** | Composite domain quality metric (0–1) | | **dom_single_domain** | Whether the “Dom” boundary method classifies the protein as single-domain (`True`/`False`) | | **foldseek_match_id** | Best Foldseek match identifier | | **foldseek_evalue** | E-value for the Foldseek match | | **foldseek_tmscore** | TM-score for structural similarity | | **cath_label** | Assigned CATH classification (if available) | | **foldseek_match_type** | `H` = homologous, `T` = topological, `N` = no confident match | | **foldseek_query_cov** | Fraction of the query domain covered by the Foldseek match | | **foldseek_target_cov** | Fraction of the target covered | | **Q_score** | Domain quality metric derived from structural and consensus features | --- ## Summary of Changes in v0.2 - Updated domain definitions produced by the newest version of the pipeline - Additional taxonomic fields in the output file - Expanded quality metrics (`domqual`, `dom_single_domain`) - Harmonised structural annotation fields across domain fragments - Revised Foldseek and CATH mapping outputs - Improved overall consistency and completeness of the TSV schema ---