Research Line
Advanced data clustering
Scalable clustering methods for chemical libraries, molecular dynamics trajectories and high-dimensional biological data. This is the backbone of our virtual screening post-processing — turning millions of poses into a tractable, diverse and prioritised set of candidates.
What we work on
Concrete clustering problems we solve at production scale.
- Chemical library clustering by 2D fingerprints, 3D shape and pharmacophore similarity for diversity selection.
- Docking pose clustering to reduce redundancy in million-pose screens and surface representative binders.
- Consensus aggregation across docking, shape and pharmacophore scoring methods.
- MD trajectory clustering by RMSD, contact maps and energy landscapes to extract representative conformations.
- High-dimensional biological data — omics, image features and time-series — clustered for downstream interpretation.
Tools we use
- MetaScreener consensus module — aggregates results from multiple VS methods into ranked, deduplicated hit lists.
- ASGARD — clustering and analysis of GROMACS MD trajectories.
- Internal clustering scripts — Python and R pipelines for chemical and biological data, run on HPC.
Applications & target areas
Where scalable clustering changes what is actually possible in a project.
Virtual screening triageReducing 10⁶–10⁷ docked poses to a diverse, manageable shortlist for experimental testing.
MD post-processingExtracting representative conformations from long simulations for further docking and free-energy work.
Diversity selectionChoosing chemically diverse subsets for screening campaigns and library design.
Biological data analysisPatient stratification, image segmentation and time-series clustering for partner projects.
Selected resources
- ASGARD — MD trajectory analysis and clustering — DOI 10.1080/07391102.2024.2349527
- MetaScreener — consensus and post-processing of VS campaigns — github.com/bio-hpc/metascreener
Interested in this line?
Contact Prof. Horacio Pérez-Sánchez · hperez@ucam.edu