Proteomics is the field that studies the proteome, the full set of proteins expressed by an organism. The most powerful method for proteome analysis relies on mass spectrometry (MS). Some amino acid sub-chains of the proteins studied, termed peptides, are ionized and fragmented in the mass spectrometer. Then, the instrument measures the mass-to-charge ratio (m/z) of the peptides and of their fragments. The masses and intensities of the fragments are returned as an experimental fragmentation spectrum, which is used to identify the peptide.
A typical identification workflow involves the systematic comparison of each experimental spectrum with all the theoretical fragmentation spectra of a reference database (which is in silico derived from the reference genome of the organism analyzed). The discrepancies between the two spectra are quantified by a score, and the best of all score for each experimental spectrum is used to match the spectrum to a peptide and then infer the protein (or gene) which originated it.
Consequently, the quality of the scoring function as a metric to quantify peptide-to-spectrum matches is paramount. Most search engines rely on scoring functions defined on R_+^N \times R_+^N, where N is the number of bins using to discretize the spectra. Such vectorizations are computationally efficient and conceptually easy to relate to the MS resolution, however they are highly sensitive to the choice of bin width.
We therefore propose a new scoring function for spectral similarity based on the Maximum Mean Discrepancy (MMD). The MMD is a kernel-based distance between probability distributions that has recently emerged as a powerful tool for machine learning and statistical inference.
When applied to MS data, the MMD interprets has follows: spectra are modelled as discrete probability distributions (sum of Dirac delta measures). The MMD is then the distance between the mean embeddings of distributions in a Reproducing Kernel Hilbert Space (RKHS).
In this poster, we provide a preliminary evaluation of the MMD for spectral similarity assessment in proteomics. We address the kernel choice and we link its hyperparameter tuning to the instrument's mass tolerance, and demonstrate the interest of the approach to alleviate binning-induced threshold effects. Finally, we present a workflow for evaluating the performance of this metric.

PDF version