CATH non-redundant S20/40_overlap_60 dataset ========================================= Overview -------- This is a non-redundant subset of CATH domains that: * contains no pair of domains that (according to BLAST) shares >= 20/40% sequence identity over >= 60% overlap (over the longer sequence) and * is as big as we could make it otherwise. For more detail, please see below. Files ----- * Cath.DataSet.NonRedundant.S40_overlap_60.v4_2_0.atom.fa - The ATOM sequences of the domains in the dataset (which only contain residues that have ATOM records in the PDB file) * Cath.DataSet.NonRedundant.S40_overlap_60.v4_2_0.fa - The sequences of the domains in the dataset * Cath.DataSet.NonRedundant.S40_overlap_60.v4_2_0.list - A list of the domains in the dataset; one domain ID per line * Cath.DataSet.NonRedundant.S40_overlap_60.v4_2_0.pdb.tgz - A (gzipped tar file containing) the PDB files of the domains in the dataset * Cath.DataSet.NonRedundant.S40_overlap_60.v4_2_0.README.txt - This file Method of Construction ---------------------- The sequence comparisons are performed with an all-against-all BLAST of our S100s. This is done by building a library of the S100 sequences with a command like: makeblastdb -in CATH.S100.COMBS_sequences.fa -dbtype prot -out new_library ...and then scanning each of the sequences against that library with commands like: blastp -dbsize 100000 -query domain_COMBS_sequence.fa -db new_library -outfmt '6 qseqid sseqid pident length slen qlen' -out domain_results_file -max_target_seqs 100000000 We then use these results to identify any links with: * different IDS, ie non-self-hits ( ie qseqid != sseqid ) * >= 20/40% sequence identity ( ie pident >= 40 ) and * >= 60% overlap over the longer sequence ( ie 100.0 * length / max(slen, qlen) >= 60 ) We then use these links to form a list of domains that contains no pair of linked entries. In an effort to make the list as large as possible, we build the list by iteratively choosing each domain to add to the list, ensuring that a domain is only added if it has as few linked neighbours as any other domain. This means the algorithm should nibble as many edges off a cluster as possible, rather than taking a small number of domains at the cluster's centre. If you have any comments/suggestions/criticisms, please let us know: http://www.cathdb.info/support/contact