Advanced data clustering.

Scalable clustering methods for chemical libraries, molecular dynamics trajectories and high-dimensional biological data. This is the backbone of our virtual screening post-processing — turning millions of poses into a tractable, diverse and prioritised set of candidates.

What we work on

Concrete clustering problems we solve at production scale.

Chemical library clustering by 2D fingerprints, 3D shape and pharmacophore similarity for diversity selection.
Docking pose clustering to reduce redundancy in million-pose screens and surface representative binders.
Consensus aggregation across docking, shape and pharmacophore scoring methods.
MD trajectory clustering by RMSD, contact maps and energy landscapes to extract representative conformations.
High-dimensional biological data — omics, image features and time-series — clustered for downstream interpretation.

Tools we use

MetaScreener consensus moduleAggregates results from multiple VS methods into ranked, deduplicated hit lists.
ASGARDClustering and analysis of GROMACS MD trajectories.
Internal clustering scriptsPython and R pipelines for chemical and biological data, run on HPC.

See all tools →

Applications & target areas

Where scalable clustering changes what is actually possible in a project.

Virtual screening triage

Reducing 10⁶–10⁷ docked poses to a diverse, manageable shortlist for experimental testing.

MD post-processing

Extracting representative conformations from long simulations for further docking and free-energy work.

Diversity selection

Choosing chemically diverse subsets for screening campaigns and library design.

Biological data analysis

Patient stratification, image segmentation and time-series clustering for partner projects.

Selected papers

Reference publications underpinning this line.

Topic	Reference
ASGARD — MD trajectory analysis and clustering	10.1080/07391102.2024.2349527
MetaScreener — consensus and post-processing of VS campaigns	github.com/bio-hpc/metascreener

Interested in this line?

Have millions of poses, trajectories or descriptors to make sense of?

Prof. Horacio Pérez-Sánchez · hperez@ucam.edu