Advanced data clustering.

Scalable clustering methods for chemical libraries, molecular dynamics trajectories and high-dimensional biological data. This is the backbone of our virtual screening post-processing — turning millions of poses into a tractable, diverse and prioritised set of candidates.

What we work on

Concrete clustering problems we solve at production scale.

  • Chemical library clustering by 2D fingerprints, 3D shape and pharmacophore similarity for diversity selection.
  • Docking pose clustering to reduce redundancy in million-pose screens and surface representative binders.
  • Consensus aggregation across docking, shape and pharmacophore scoring methods.
  • MD trajectory clustering by RMSD, contact maps and energy landscapes to extract representative conformations.
  • High-dimensional biological data — omics, image features and time-series — clustered for downstream interpretation.

Tools we use

  • MetaScreener consensus moduleAggregates results from multiple VS methods into ranked, deduplicated hit lists.
  • ASGARDClustering and analysis of GROMACS MD trajectories.
  • Internal clustering scriptsPython and R pipelines for chemical and biological data, run on HPC.
See all tools →

Applications & target areas

Where scalable clustering changes what is actually possible in a project.

Virtual screening triage

Reducing 10⁶–10⁷ docked poses to a diverse, manageable shortlist for experimental testing.

MD post-processing

Extracting representative conformations from long simulations for further docking and free-energy work.

Diversity selection

Choosing chemically diverse subsets for screening campaigns and library design.

Biological data analysis

Patient stratification, image segmentation and time-series clustering for partner projects.

Selected papers

Reference publications underpinning this line.

TopicReference
ASGARD — MD trajectory analysis and clustering10.1080/07391102.2024.2349527
MetaScreener — consensus and post-processing of VS campaignsgithub.com/bio-hpc/metascreener
Interested in this line?

Have millions of poses, trajectories or descriptors to make sense of?

Prof. Horacio Pérez-Sánchez · hperez@ucam.edu