CorrEA - Correlations between calculated Energies and Activities in ligand-receptor systems

Name

Name and/or surname

Email addres

Message

Your message here

Description of this software

This software identifies the maximum correlation between experimental activity and computational energies in Cross-Docking results. In this way, it is found the binding pose of each congeneric ligand and the protein conformations involved in molecular recognition. This is achieved using an optimization algorithm implemented through a genetic algorithm that performs M iterations. The algorithm works as follows:

The algorithm optimizes combinations (reported experimental activity versus a score value of a binding pose for each of the congeneric ligands) of molecular docking poses, which are evaluated using a scoring function that calculates the correlation coefficient between the experimental activity value and the computationally predicted energy. The process begins with the generation of N combinations, where each combination is evaluated using the coefficient of determination (R²). Those combinations that exhibit the highest R² values are selected and proceed to the next iteration cycle. Within this same cycle, new combinations are generated using a crossover process based on existing combinations. These new combinations may or may not undergo point mutations, which mostly involve changes in the indicators of the ligand binding poses, leading to new energy values to be correlated. In addition, the option to reduce the number of protein conformations to simulate the conformational selection theory has been implemented, thus generating new combinations until the same N of combinations is reached. This process is repeated iteratively until a predefined convergence criterion is met, finally allowing the identification of the optimal combination with the highest R² value.

The parameters of the genetic algorithm are as follows:

Number of iterations (-gn): Number of cycles or iterations to search for the combination with the maximum R² value between experimental activity and computationally predicted energy. Default value: 2000.
Number of possible combinations (-gp): Total number of combinations to be evaluated with R² in each iteration. Default value: 200.
Mutation rate (-gm): Percentage of mutations that change the ligand binding pose, resulting in new computationally predicted energy values to be correlated. Default value: 10%.
Percentage of crossing (-gc): Crossover percentage between existing combinations to generate new combinations. Default value: 90%.
Offspring elitism rate (-goe): Determines which of the current combinations (those with the highest R² scores) are selected as offspring for the next iteration cycle. Default value: 30%.
Offspring random rate (-gor): It controls how much of the offspring is generated randomly. Default value: 10%.
Penalty coefficient per protein (-gr): Penalty coefficient that controls the extent to which the number of protein conformations affects the individual score. Fewer protein conformations are always preferred. Default value: 10%.
Protein trim rat (-gt): Trimming a gene involves randomly changing the positions of a selected protein conformation to reduce the number of protein conformations in the gene by one. New valid ligand poses are selected such that they belong to the protein conformations of the current gene minus the excluded one. Default value: 10%.

The values assigned to each parameter of the genetic algorithm were determined through multiple exploratory runs and reflect a balance between convergence stability and computational efficiency. While the default values are suitable for most use cases, users may fine-tune these settings depending on the number of ligands, protein conformations, and desired precision.

In general, increasing the number of iterations (-gn) or the combinations size (-gp) tends to improve the exploration of the search space and the likelihood of achieving higher correlations, at the cost of longer runtimes. A higher mutation rate (-gm) can help escape local optima, though excessive mutation may reduce convergence stability. The crossover percentage (-gc) controls the balance between preserving existing high-quality solutions and generating diversity. The elitism rate (-goe) and random rate (-gor) influence how much the algorithm exploits current solutions versus explores new ones.

The penalty coefficient (-gr) and protein trimming rate (-gt) are particularly relevant when applying conformational selection: increasing these values biases the algorithm toward solutions involving fewer protein conformations, potentially highlighting more selective molecular recognition patterns.

Prune and search method

The genetic algorithm operates as a random search engine within a vast space of combination possibilities. This causes different combinations with maximum R² to be found when repeated with the same parameters. To address this challenge, the prune and search method was implemented. This technique involves grouping the poses of each ligand to reduce the search space, thus identifying the unique combination with the ligand binding energy pose that contributes to the maximum R² value. The prune and search method allows for reductions in the iterations or combinations size of the genetic algorithm.

The parameters of the prune and search method:

Grouping ligand poses (-I): Enable the iterative procedure to reduce the combinatorial search space of docking poses using the prune and search method. Greatly improves the convergence of the search but the result may be suboptimal.
Number of groups (-k): Number of groups used at each iteration in the iterative search. Note that a larger value would require a greater number of GA iterations to converge.