Build a mass spectrometry analysis pipeline in Python using matchms — part III: molecular networking

After “part I” and “part II” which introduced importing and processing a mass spectra dataset and computing spectral similarities using Python, matchms, and Spec2Vec, we will now generate some nice networks!

Florian Huber
Netherlands eScience Center
7 min readAug 13, 2021

[last edit: 31/01/2023, using matchms version 0.18.0, spec2vec version 0.8.0, matchmsextras version 0.3.0]

Zoom into a few clusters of the network generated in this tutorial (node color reflects compound mass)

In part I of this tutorial, matchms, a Python library to import, process, and compare tandem mass spectrometry data was introduced. We imported a MS/MS dataset with spectra of natural products which we downloaded from GNPS. The data then was processed using matchms to gain more reliable metadata and remove possible noise peaks.

Then, in part II, we introduced Spec2Vec which is an unsupervised machine learning technique to compute spectral similarities.

Getting the required packages

If you have run part I and II before then you can simply add the package matchmsextras by pip install matchmsextras.

Import the processed data from part I

If you did the data importing and processing as shown in part I then you can simply continue from there, e.g. by importing the data that was stored after the processing:

Load a pretrained Spec2Vec model

For many use-cases we would not advise to retrain a new model from scratch. Instead a more general model that has been trained on a large MS/MS dataset can simply be loaded and used to calculate similarities, even for spectra which were not part of the initial model training.

Here we can download a model trained on about 95,000 spectra from GNPS (positive ionmode) which can be downloaded from Zenodo.
Now let’s import this model.

Compute similarities

As shown in part II, we can use the pretrained model to compute the similarities between all spectra in spetrums.

Running this will give a warning that one word is missing which simply means that one peak in the data does not occur in the model vocabulary. Since it seems to be a small peak, the impact of this is small (0.19%) and will be ignored.

Now that we have the similarities for all possible spectrum pairs, we can generate a network (or graph). This is an operation that is currently done a lot in the mass spec community, mainly triggered by GNPS and its community, and is often referred to as molecular networking. See also Wang et al. (2016).

What it mostly means, is that spectra are treated as nodes in a network and similar spectra (according to a chosen similarity score) will be connected by edges. Typically edges are only added if they are above a minimum threshold score. Often also the number of edges is restricted by setting a maximum of edges per node.

Compute network/graph from similarities

Graphs based on similarities can be generated (and pruned/modified) in many different ways. Here we will start rather simple. The function create_network() will create a node for each spectrum. Then it will add edges (connections) for each node in the following way: For each spectrum, the max_links highest scoring connections will be added, as long as they have a similarity score of > cutoff. That means, per node, between 0 (if no high enough similarity score exists) and max_links connections will be added.

With matchms versions lower than 0.9.1 you would have to run it via functions from matchmsextraslike this:

Actually, that’s it. Now a network was generated based on networkx which also offers plenty of options for further network analysis (see the networkx documentation).

Let’s have a look at some of the possible next steps to inspect the results.

1. We can plot the network with Python (or better: with networkx)

Python has tons of plotting options, but it is usually not my first go-to if it comes to interactive plotting. It can still be useful to start exploring your network to get a first sense of how things look and to check if things work the way they should. In matchmsextras.networkingwe provide a very basic network plotting function: plot_cluster()which can simply be run like this:

Networkx plot of the generated network… not particularly helpful, right?

Lacking any interactivity this kind of plot is not the most helpful. But it does give a first sense of how our network looks like. We see several clusters (or subnetworks) which are not connected to each other, and plenty of smaller clusters or individual, not-connected nodes (singletons) on the periphery. But the plot itself is too crowded to see what’s going on in those clusters.
So we could move on to:

2. Plot individual clusters

The plot above is not very revealing because it contains too many nodes, edges and labels. So let’s shift the focus to the more interesting stuff. First, let’s split the network into subnetworks by which I here mean connected areas.

Most of the time nodes with no (or very few) connections are less interesting, so let’s kick them out and only list the larger clusters.

This will output a dictionary that starts like this:

The first number is the index in subnetworksand the second number is the number of nodes in that subnetwork. Let’s pick one of those clusters, say number 19which contains 32 spectra.

One of the clusters in the netowork (32 spectra). Edge line thickness corresponds to Spec2Vec similarity (thicker lines mean more similar).

That is more insightful. Yet, to interactively explore the graph it is generally better to go to

3. Export the graph and plot elsewhere (e.g. Cytoscape)

This comprises two steps. First, we will export the graph/network as a graphml file which contains all edges and nodes. Second, we will also export the metadata which can later be used to color or label the graph.

And once the graph is exported, we would also like to export the metadata:

Using a program such as Cytoscape now allows to plot this network and interactively explore it as much as you like…

Example created using Cytoscape. The nodes are colored based on parent_mass. The edge thickness reflects the Spec2Vec similarity score from 0.7 (thin) to 1.0 (thick).

Other parts of this tutorial

part I — Build you own mass spectrometry pipeline

part II — Compute spectra similarities using Spec2Vec

part III — Network analysis based on spectra similarities

matchms library

matchms is freely available and open-source. You can find the code on Github, the package is available as conda package (recommended) or from pypi. It was developed by a bunch of very nice people (see here) and is published in Joss.

spec2vec library

Spec2Vec is of course also freely available and open-source. You can find the code on Github, the package is available as conda package (recommended) or from pypi. The article published in PLOS Computational Biology can be found here.

You can find more information in the matchms documentation and the spec2vec documentation.

Code

The code shown in this blog post can also be found as Jupyter notebook on GitHub.

Help us develop matchms and spec2vec further!

We really hope that matchms will help people to build new, better analysis pipelines and that spec2vec can improve your analysis results. So, please help us to further improve matchms and spec2vec. You can help us by (1) Cite our work if you use our package for your research: matchms article in JOSS, spec2vec article in PLOS Computational Biology.
(2) Let us know if you run into any issues using matchms and/or spec2vec, or if you are missing any key functionalities! You can simply do so by starting a new issue on the matchms GitHub repository, or contact me on twitter @me_datapoint.
You are of course more than welcome to also become an active contributor to the package and help extend its functionality!
(3) Please let us know if you successfully applied matchms and/or spec2vec in your research project (makes us happy + we can maybe share the good news!).

Thanks a lot!

Published in Netherlands eScience Center

We’re an independent foundation with 80+ passionate people working together in the Netherlands’ national centre for academic research software.

Written by Florian Huber

Professor for Data Science at University of Applied Sciences Düsseldorf | research software engineer | former biological physicist | former chocolatier |

Responses (1)

What are your thoughts?