PREreview of AlphaFold Meets Flow Matching for Generating Protein Ensembles

by Ashraya Ravikumar, Stephanie Wankowicz, Flip Jansen, and 1 other author

Published: April 25, 2024
DOI: 10.5281/zenodo.11062553
License: CC BY 4.0

AlphaFold Meets Flow Matching for Generating Protein Ensembles

Summary

Major progress has been made using ML methods to predict single static structures of proteins given their primary sequence (e.g. using AlphaFold or ESMFold). But, given that proteins exist as an ensemble of conformations, there is a need for methods that can predict these ensembles given the sequence or a single structure. Although there are some computational methods available for this task, in this paper, the authors take a new approach by converting Alphafold/ESMfold, which are regressive models, into generative models using flow matching. Flow matching is a generalization of diffusion modeling and in this case, works by iteratively sampling from a harmonic prior and interpolating with a data point to create a noisy input template for Alphafold/ESMfold along with a constant sequence. The authors compare their method to an increasingly popular method for generating diversity from AF predictions: MSA subsampling. They evaluate the ensemble outputs from Alphaflow against pseudo-ensembles of previously solved structures deposited in PDB and ensembles of structures generated using MD. They use a combination of informative metrics such as diversity, recall, and precision of the Alphaflow and MSA subsampling ensembles as compared with the ground truth ensembles and show that their method is better than MSA subsampling in the diversity/recall v. precision trade-off frontiers. Unlike existing diffusion methods, flow matching is also able to handle missing residues in the input structure and is invariant to structure size. As mentioned by the authors, the limitations to their approach remain that it cannot yet be used to generate temporally ordered ensembles, it does not aim to sample the whole Boltzmann distribution of the protein of interest, and cannot be used for any kinetic studies. There is also conformational diversity they are not able to recapitulate.

We find this to be an important paper with very interesting results that have been well presented and well written. Some of the key limitations we think the current study has is a lack of demonstration of its ability to generate biologically relevant conformations as part of the ensembles and comparison to existing generative diffusion-like methods. We expand on these points and other questions we have below:

Major points

Out of the 563 proteins that satisfy all the criteria in the test set, the authors say they sub-sampled 100 structures to form the test set. It is unclear why they chose to do this. Did they do any bootstrapping on this subsample? Also, the chain length range chosen by the authors (265-765) seems strange given that a large number of proteins have length under their lower limit.
The authors fixed the number of flow-matching steps to 10. However, it will be useful and interesting to see the relationship between diversity/recall v. precision tradeoff and the number of flow-matching steps.
The authors mention that they perform RMSD Alignment as one of the tweaks to make this method work in the Quotient space (in section A2, before equation 15). But the motivation behind this decision is not obvious or made clear to the reader.
Although the authors have shown improvement at the ensemble level over MSA subsampling, some MSA subsampling methods have been able to sample biologically important conformations such as the open and closed state in kinases (for example, as shown in https://www.biorxiv.org/content/10.1101/2023.07.25.550545v3). Have the authors had success in generating such biologically relevant conformational transitions? Also, given the low level of aggregate recall, is the Alphaflow-generated ensemble heterogeneity representative of biologically relevant heterogeneity that is not present in the PDB or MD simulations? Or the other way around, is Alphaflow not generating the biologically relevant type of heterogeneity(e.g. not around equilibrium), which is present in the PDB
We understand that in order to generalize the results of the comparison between Alphaflow ensembles and MD ensembles, the authors have resorted to using the ATLAS database since it has trajectories of a representative set of proteins. However, as an extension of the previous point, to truly evaluate Alphaflow’s ability to sample biologically relevant conformations, it will be interesting to compare Alphaflow outputs to some long MD simulations as shown for example by Riccabona et al, 2024. With long duration simulations, it is possible to plot a free energy landscape of the protein and project the Alphaflow predicted structures onto the landscape and directly visualize the extent of sampling achieved by the method.
We were wondering why the mentioned training cutoff of May 1, 2018, appears to coincide with the Alphafold1 training cutoff, suggesting the use of Alphafold1. Even though it is mentioned that Openfold (Ahdritz et al, 2022) is used for finetuning, which is modeled to be analogous to Alphafold 2 with its corresponding 2020 cutoff date. We pose that Alphaflow is in fact trained by finetuning the Alphafold2 model and that the early cutoff date may have been chosen to offer a larger potential testing protein dataset, does that hold?
Besides MSA subsampling, the authors have not compared their method to other (iterative denoising) models used for structure prediction and ensemble generation - e.g. comparing performance to a Distributional Graphormer (Zheng et al, 2023) or Eigenfold (Jing et al, 2023) for a wider range of protein sizes.
Ahdritz et al, 2022 show with Openfold that Alphafold - colloquially - greedily reduces FAPE loss during training by forming a rudimentary PCA-like representation of the input structure and solving the structure for that in an iterative dimension-increasing fashion. How does using a FAPE^2 loss influence this behavior? We would expect this kind of greedy PCA representation behavior to increase.
We would also be interested in the diversity inherent in the PDB set and how that compares to the diversity found in the compared methods.

Minor points

The current method takes only the positions of the Carbon-beta (Cβ) atoms as input, and consequently, the optimization process is performed solely over these positions. However, it is important to note that the selection of the best-fitting overall structure is not limited to Cβ information alone. The method incorporates whole structure information, including the positions of Cβ and other residue atoms, when calculating the RMSD-aligned loss function. This allows for a more comprehensive evaluation of the predicted structure's quality but raises a question about the impact of incorporating additional information beyond Cβ positions. It remains unclear how the inclusion of more detailed structural data, such as the positions of other backbone or side-chain atoms, could influence the prediction outcomes.
Is it possible to sample MSA in numbers that are not powers of 2? We ask this because, in certain metrics shown in Table 1, based on the trends in MSA subsampling, it seems like choosing an intermediate number might make the method match the performance of Alphaflow. For instance, if the MSA was subsampled at 48 instead of 32 or 64, maybe the pairwise RMSD of the MSA subsample ensemble might match that of the MD ensemble.
To us, it is clear why the authors would use the Frame Aligned Point Error loss. But we are unsure why the loss is squared. We think that, unlike a single-structure-predicting Alphafold, we are now looking at ensembles and expect the predictions to be quite close to each other. Therefore one would want to penalize relatively small differences more severely than usual. It will be good to know the authors’ thought process behind this choice.
The schematic shown in Figure 1 is quite helpful. However, we would like to know if there is any reason behind the authors’ choice to show a symmetric bifurcation in the flowfield. Is this indicative of some underlying “sub-classes” of structures in the predicted ensembles?
Is there any roadmap on how to adapt Alphaflow so that it could be used for studying the elusive class of membrane proteins?

To us, it is unclear how in inference time the stepsize/number of steps has been chosen.
“When necessary, we subsample or replicate by the appropriate power of 2 to ensure all analyses operate on 256 frames (important for finite-sample Wasserstein distances).” What do the authors mean by replicate? Is it a duplication of the output structures or something else?
The distillation procedure is described poorly in the appendix. What is X and Y?
To us, it is unclear why in panel “PC > 0.5” of Figure 4 all the metrics go down in the last increase step of allotted GPU time.
Repeated use of “the” in Section 3.2, below function (5), in the sentence “..discussed previously to be immediately used as the the denoising model xˆ1(x, t; θ), with x as the noisy input and t as an additional time embedding.”

Reviewed by

Flip Jansen, Ashraya Ravikumar, Stephanie Wankowicz, James Fraser

UCSF

Competing interests

The authors declare that they have no competing interests.