PDBselect
Database Description
PDBselect started, when the realm of protein chains with known 3D structure was around 700, less than 1% of the January 2007 count, resulting in a representative list of 155 protein chains with mutual sequence similarity of less than 30 percent (in subsequent releases we used a threshold of 25 percent). To generate the representative list of protein chains, an all-versus-all sequence comparison was implemented. The distance between two protein sequences is calculated by applying the HSSP-function, later refined by Abagyan and Batalov based on a larger data set. When two protein chains score related by the function, the one with lower quality is removed, to end up with a representative list of high quality structures. Quality is defined as "resolution [in Angstrom] plus R-factor/20", with NMR structures allocated an arbitrary (low) quality.
References
2. Hobohm, U. and Sander, C. (1994) Enlarged representative set of protein structures. Protein Sci, 3, 522-524.