# Domain Annotation Pipeline: BFVD — Release v0.2

This repository contains the **second public data release** of the  
[UCL Orengo Group domain-annotation-pipeline](https://github.com/UCLOrengoGroup/domain-annotation-pipeline).

The pipeline integrates predicted protein structures with domain boundary detection, structural annotations, and functional classification, producing a domain-level dataset derived from viral proteomes.

---

## Changelog

- **v0.2 — Second data release (Dec 2025)**
  - Updated pipeline version and domain assignments  
  - Expanded structural and taxonomic annotations  
  - Revised quality metrics (e.g., `domqual`, `dom_single_domain`)  
  - Minor schema updates to match the current Nextflow workflow  

- **v0.1 — Draft results (Sep 2025)**

---

## Overview of the Pipeline

1. **Input structures**  
   Viral protein structural models from **AlphaFold** (UniProt / BFVD) as zip file.

2. **Domain boundary prediction**  
   Three independent domain boundary prediction methods are applied.  
   A **consensus domain definition** is inferred from agreement between predictors.

3. **Domain extraction**  
   Consensus domains are **chopped** from the full AlphaFold models and stored as individual PDB files.

4. **Domain annotation**  
   Each chopped domain is annotated with structural and functional properties, including:
   - Secondary structure composition  
   - Packing density and radius of gyration  
   - AlphaFold pLDDT–derived quality information  
   - Fold classification (via **Foldseek** and **CATH**)  
   - Domain quality metrics and consensus confidence  
   - Viral taxonomic metadata  

---

## Example Results File (v0.2)

Results are provided as a **tab-separated value (`.tsv`)** file.  
Each row corresponds to one **consensus structural domain** derived from a viral protein.

### Columns

| Column | Description |
|--------|-------------|
| **uniprot_id** | UniProt identifier with domain index suffix (e.g. `B5BTU2_01`) |
| **md5_domain** | MD5 checksum of the chopped domain PDB (unique domain identifier) |
| **consensus_level** | Confidence in consensus domain boundary (`high`, `med`, etc.) |
| **chopping** | Domain residue range(s), including multi-segment definitions |
| **nres_domain** | Number of residues in the chopped domain |
| **num_segments** | Number of continuous segments in the domain |
| **num_helix_strand_turn** | Total secondary structure elements |
| **num_helix** | Number of α-helices |
| **num_strand** | Number of β-strands |
| **num_helix_strand** | Count of helices + strands |
| **num_turn** | Number of turns |
| **packing_density** | Measure of structural compactness |
| **normed_radius_gyration** | Domain radius of gyration normalized by length |
| **avg_plddt** | Average pLDDT score for the domain |
| **proteome_id** | Identifier for the source viral proteome |
| **tax_common_name** | Common name of the source species/virus |
| **tax_scientific_name** | Scientific name |
| **tax_lineage** | Full taxonomic lineage string |
| **domqual** | Composite domain quality metric (0–1) |
| **dom_single_domain** | Whether the “Dom” boundary method classifies the protein as single-domain (`True`/`False`) |
| **foldseek_match_id** | Best Foldseek match identifier |
| **foldseek_evalue** | E-value for the Foldseek match |
| **foldseek_tmscore** | TM-score for structural similarity |
| **cath_label** | Assigned CATH classification (if available) |
| **foldseek_match_type** | `H` = homologous, `T` = topological, `N` = no confident match |
| **foldseek_query_cov** | Fraction of the query domain covered by the Foldseek match |
| **foldseek_target_cov** | Fraction of the target covered |
| **Q_score** | Domain quality metric derived from structural and consensus features |

---

## Summary of Changes in v0.2

- Updated domain definitions produced by the newest version of the pipeline  
- Additional taxonomic fields in the output file  
- Expanded quality metrics (`domqual`, `dom_single_domain`)  
- Harmonised structural annotation fields across domain fragments  
- Revised Foldseek and CATH mapping outputs  
- Improved overall consistency and completeness of the TSV schema  

---