Protein structure space is enormous 1,2. Most of the linear polymer of amino acids has intrinsic nature of forming a dynamic and flexible 3D structure; which poses one of the conundrum of present days 3,5. Mapping the variation in the structure space will provide an insight into the organisation and evolvability in protein fold space 1,4,5.
The local arrangements (or secondary structure elements (SSEs)) provide a regular pattern in the protein chain, which is well defined 9,3. The arrangements of these patterns have been addressed in architecture level (in -CATH) 7 and fold level (in -SCOP)8 classifications. However, these definitions are influenced by evolutionary and sequence information, which is bias by current limitation of knowledge space10.
Here, with ProLego, we provide a simple and intuitive way to study protein structure space, using core definition of "topology", as used in case of protein structures 1,3. With focus on secondary structures (alpha helix and strands), we have cataloged the protein structure topology variation in current structure space.
Definition of Topology 1,3Here topology is defined as the arrangement of secondary structure, their spatial contacts and relative orientation. This definition help to address the crucial aspects of local and non-local contacts and their relative position in the context of 3D structure.
As defined in the above section, topology of a protein chains have been extracted [ref. alpha paper]. Each topology depends on the composition of secondary structures (sequential arrangement of SSEs from N to C-terminal) and their total count. Topology browser, provides the tabular representation of different SSE combination and their distinct topology.
Entries can be searched by either class of structure (topClass: Alpha (A), Beta (B) or Alpha-Beta (AB)) or, Number of helix or, number of strands or, composition of Secondary structure Elements (SSE).
[Total topologies : 1292]
A pre-calculated set of non-redundant curated datasets from PDB [different release dates and different sequence homology cutoffs] can be access in the protein browser. The protein browser shows a tabulated view (paginated at 30 entries), of the complete protein data in ProLego DB.
Entries can be searched with a PDB Id or chain Id. The resultant table shows result similar to the queries Id. The table has information on molecule which links to the RCSB-PDB database for cross reference and biological source of the protein.
[Total protein (non-redundant data @ 60% sequence homology cluster) : 31034]
Domain set from CATH (v. 4.1 : CATH_4.1) and SCOP (v.1.75 ASTRAL_30) has been pre-calculated for exploring topologies. The browser provides interface for searching by domain Id, structure class (i.e CATH or SCOP structure classification string), molecule of in the domain and its biological source.
[Total Domains (CATH + SCOP) @ 30% sequence homology cluster : 15255]Given Protein Database 4 letter protein id (or PDB Id), the search engine extracts PDB molecule. The same page will give a brief description of molecule in the PDB file and prompts for selection of a chain. Satisfying the required conditions, the topology generation process starts. Once the topology has been built, result page with all information will appear.
The user can upload a PDB formatted protein file with a keyword description. Upon successful submission and upload of requested file, topology building starts and redirected to the result page.
From the Topology browser, once a topology has been selected (by clicking on the "see topology"), user will be redirected to the queried topology page. This page has details like statistical significance and prevalence status (True/False) of queried topology along with available significance score (P-Values [see below]). The panel on the right shows a generic graph representation of topologies in this group.
Following the top menu, the bottom panel has the list of protein and domains in that topology group. Relevant information of the protein chians has been provided that can be redirected to a particular protein page.
Prolego server protein (or Domain) results:
The right panel shows a brief introduction of topology and its prevalence status as defined by the ProLego DB.
In ProLego server, there are two level of statistical tests. As the grouping of proteins is primarily based on the composition of secondary structures, it is crucial to verify the representation of the group is statistically significant or not. The next level of test is on the prevalence of a group of topologies.
As shown in the figure on left, contact between secondary structures pairs are projected in an adjacency matrix. The contact type encodes the SSE pair (H:H as H; E:E as E and H:E as C) in connection with the contact orientation (a, p and r). The 1D contact string is a segmental representation (separated by "-") of the adjacency matrix, where the sequentially adjacent SSEs are the first segment and sequentially most distant SSE (N and C) is last segment.
Please refer to : Khan, T., & Ghosh, I. (2015). Modularity in protein structures: study on all-alpha proteins. Journal of Biomolecular Structure and Dynamics, 1102(May), 1–15.
Protein Data Base has been filtered for X-ray structures with good resolution structures and sequence identity clusters of 80%, 60% and 30%. The non-redundant subsets are generated from CD-Hit and PISCES server.The mail goal of data variation is to check the consistency of the resulted topology groups and robustness of the prevalence classes.
The application is developed in Python (v2.7.9-12). The web server is implemented in django (> v. 1.9). The database support is by MySQL-lite.
ProLegoDB's, statistically significant topologies can be used, by running the local version of proLego. The implementation details can be found with the git hub repository (as mentioned below). This can also used if you have ensemble of structures. The output result will be in a JSON format with topology information as well as module information.
Code can be downloaded from GitHub repository.The figure shows the schema of prolego database. The 4 tables are connected with appropriate keys.
Taushif Khan* @
PhD Student, MANF-SRF
email: taushi14_sit AT jnu.ac.in, taushifkhan AT gmail.com
Shailesh K. Panday* @
PhD Student, BINC-JRF
email: shaile27_sit AT jnu.ac.in
Prof. Indira Ghosh*
email: indira0654 AT gmail.com
Developers would like to acknowledge the help from followings:
Acknowledgement for funding