CATH non-redundant S20_overlap_60 / S40_overlap_60 dataset
=========================================

Overview
--------

Two non-redundant subsets of the CATH domains that each:
 * contains no pair of domains that (according to BLAST) shares >= X% sequence identity over >= 60% overlap (over the longer sequence) and
 * is as big as we could make it otherwise.

There is one data set for 20% sequence id (containing 9,894 domains in v4_1_0) and another for 40% sequence identity (21,090 domains in v4_1_0)

For more detail, please see below.

Files
-----

 * Cath.DataSet.NonRedundant.S20_overlap_60.v4_1_0.atom.fa    - The ATOM sequences of the domains in the dataset (which only contain residues that have ATOM records in the PDB file)
 * Cath.DataSet.NonRedundant.S20_overlap_60.v4_1_0.fa         - The sequences of the domains in the dataset
 * Cath.DataSet.NonRedundant.S20_overlap_60.v4_1_0.list       - A list of the domains in the dataset; one domain ID per line
 * Cath.DataSet.NonRedundant.S20_overlap_60.v4_1_0.pdb.tgz    - A (gzipped tar file containing) the PDB files of the domains in the dataset

...and repeated for all levels of sequence identity.

Method of Construction
----------------------

The sequence comparisons are performed with an all-against-all BLAST of our S100s. This is done by building a library of the S100 sequences with a command like:

makeblastdb -in CATH.S100.COMBS_sequences.fa -dbtype prot -out new_library

...and then scanning each of the sequences against that library with commands like:

blastp -dbsize 100000 -query domain_COMBS_sequence.fa -db new_library -outfmt '6 qseqid sseqid pident length slen qlen' -out domain_results_file -max_target_seqs 100000000

We then use these results to identify any links with:
 * different IDS, ie non-self-hits         ( ie qseqid != sseqid )
 * >= X% sequence identity                 ( ie pident >= X, where X is 20% or 40% ) and
 * >= 60% overlap over the longer sequence ( ie 100.0 * length / max(slen, qlen) >= 60 )

We then use these links to form a list of domains that contains no pair of linked entries. In an effort to make the list as large as possible, we build the list by iteratively choosing each domain to add to the list, ensuring that a domain is only added if it has as few linked neighbours as any other domain. This means the algorithm should nibble as many edges off a cluster as possible, rather than taking a small number of domains at the cluster's centre.

If you have any comments/suggestions/criticisms, please let us know:

http://www.cathdb.info/support/contact