PREreview of Deep indel mutagenesis reveals the impact of amino acid insertions and deletions on protein stability and function

by Priyanka Bajaj, Karson Chrispens, and 2 other authors

Published: May 7, 2024
DOI: 10.5281/zenodo.11131001
License: CC BY 4.0

Summary:

Deep mutational scanning has revealed the impact of individual point mutants on protein function and improved computational predictors of mutational effects. However, many mutations observed in evolution and disease are the result of insertions or deletions (indels) and the impact of these mutations are poorly predicted computationally relative to missense mutations. While a few other papers have profiled InDels in a systematic way, the major contributions of this paper are: 1) profiling indels across many different proteins (9), 2) profiling InDels for both stability and function, 3) introducing the concept of high throughput profiling of DelSub mutations (Mutations that remove an aa and substitute an aa in a single event)

The authors make a library of substitutions, insertions and deletions of 9 protein domains. They carry out deep sequencing coupled to fragment complementation assay (aPCA) that is well correlated with expression/stability. They then go deep on two peptide binding domains, comparing functional mutational scans as well. This provides a rich set of data to compare to computational predictions, where notable limitations in current prediction methods are identified. The major limitation of the paper is in the presentation - there is a lot of data and the figures are quite complex - but the text is brief and difficult to follow in parts. An expanded text and breaking up the figures into more figures would likely improve the ability to extract insights from these impressive datasets. Additionally further discussion of the results within the context of past literature would be helpful in guiding interpretation of the study.

Major points:

- The authors included 9 domains that span classical motifs to include in their indel scan. From our reading, it is unclear what rationale the authors used to include these domains. What makes these a diverse set of domains (a/b content? size? eukaryotic/prokaryotic origin? other topological features? etc)? This will help the reader understand how to generalise the results.

- In section “Evaluating indel variant effect prediction”, authors can comment on why PROVEAN is better suited to predict insertions and deletions relative to substitutions. In contrast, why CADD predicts substitutions better than PROVEAN? What design choices can distinguish PROVEAN from CADD? Is there a way a model could be trained that could perform well on both?

- In the section “Structural determinants of indel tolerance,” the authors mention multiple features that seem potentially important for the effects of insertions and deletions. Currently specific patterns are discussed for the substitutions but the main draw of this manuscript it contains indel mutagenesis across many proteins and the discussion in this section regarding indels are vague beyond that indels are more tolerated at the N and C termini, that the secondary structure is important, and where the indel is seems to have an impact. Currently in our reading these are vague descriptions and perhaps it would be possible to describe general trends? What specific lengths are tolerated vs not? Which secondary structural elements are more sensitive? It would be helpful to clarify these trends with existing literature. For example, previous work in a potassium channel kir2.1 (Macdonald, CM, Genome Biology 2023) found that deletions were more disruptive than insertions in beta sheets especially. Is that also seen?

- The authors train a model as described in the ‘Accurate indel variant effect prediction’ section to predict the effects of indels within all the proteins that are contained within the screen and another manuscript Tsuboyama et al. However, there are other previous indel scans that have been done including one within a viral AAV capsid protein (<https: www.science.org="" doi="" 10.1126="" science.aaw2900="">), a potassium channel kir2.1 (<https: genomebiology.biomedcentral...="" articles="" 10.1186="" s13059-023-02880-6#citeas="">), and an amyloid protein that involved the senior author (<https: pubmed.ncbi.nlm.nih.gov="" 36400770=""/>). It may be useful to test performance of the model on these datasets that were not generated within the same study and discuss where the model performs well vs those that do not perform as well.

- The authors find that insertions can generate gain-of-function molecular phenotypes at higher rates relative to deletions and substitutions. Overall, the manuscript presents the results of deep indel mutagenesis on several protein domains, but lacks thorough discussion of the results. A discussion addressing the possibility of non-native ligand binding following indel mutations would provide an evolutionary perspective that contextualises this research.

- The authors mention that some gain-of-function mutations occur due to short insertion mutations. While some domain insertions have been shown to have stabilising effects (<https: doi.org="" 10.1371="" journal.pcbi.1006008="">, <https: doi.org="" 10.1038="" s41467-018-08171-0="">, <https: www.nature.com="" articles="" s41467-021-27342-0="">), the findings presented here are novel for being short insertions but prior work on domains and deletions of varying type being beneficial would be useful. Emphasising this in the “Insertions generate gain-of-function molecular phenotypes” section and considering how the insertions in the PSD95-PDZ3 domain might increase stability would enhance the understanding of the underlying mechanisms driving these gain-of-function phenotypes. - Related to gain-of-function: The last sentence of the discussion section references the potential usefulness of indel mutagenesis for protein engineering. As the paper notes, the results here will impact the protein engineering field significantly, so further discussion here will help extend the reach of this paper. Discussing what protein engineering strategies would be enhanced (e.g. directed evolution) would help the reader in evaluating the impact of the data presented. There are a few related papers in the literature (e.g.[ https://doi.org/10.1073/pna...](https://doi.org/10.1073/pna... that could support this.

Minor points:

- Several of the figures are very information-dense. This manuscript would benefit from breaking up these figures and reorganizing them to make them easier to understand. It may be helpful also to focus the figures on the main points the authors would like to make within the manuscript as currently the sheer amount of data and analyses makes it difficult to follow the narrative. Additionally, figures could benefit from having secondary structure representations above or below heatmaps to aid in interpretation.

- Figure 1 contains A-C labeled sections but contains 9 comprehensive experiments with 6 subpanels each. It is very difficult to evaluate the data when it is so densely represented. aligning an “unfolded” secondary structure below the heatmap for all domains might be clearer than colouring by secondary structure. We find secondary structure coloring to obscure the patterns. - Figure 2 would benefit from significant reorganization perhaps around SH3 and PDZ domains specifically. The spacing within this figure makes it difficult to follow the immense amount of comparisons contained here. Perhaps it would be worth separating this figure up or moving some of the comparisons to the supplement. As in Figure 1 secondary structure above or below the heatmaps would be helpful.

- In figure 5E data is shown for GRB2-SH3 in which some mutants show high binding and low abundance. Additionally these mutants are most prevalent in the core and surface. What could be an explanation for such a phenotype for core mutants, since they are bound to have the most destabilizing effect. For the surface mutants, are those residues close to the binding site? Additionally the distributions of these two plots look fundamentally different (with a much higher correlation between binding and abundance for PDZ and a distinct change in the pattern of binding residues being gain or loss of function across the two domains). What does that say about the baseline stability of GRB2 vs PSD95? Or is this more representative of some methodological aspect of the dynamic range and sensitivity within respective assays when run on these specific proteins?

- In the heatmap figures the coloring is confusing. Currently they are colored from red to white to blue - in the corresponding color bars next to the heatmaps white is not centered at 0. Presumably 0 is wildtype fitness and blue and red are greater and less than wildtype fitness, respectively. This should be explicitly stated within all figure legends. White should correspond to 0 otherwise it makes it difficult to determine the effect of a mutation.

- Figure 3E contains violin plots however a boxplot is missing from these that indicates the median, interquartile range, and whiskers to represent non-outlier distributions. This would be useful in comparing across these variant types

- In ‘Materials and Methods’ section, we were confused by the filtering steps taken in the ‘sequence data processing’ section. It would be helpful to include the minimum read cut-off.

- Figure 4b should be explained more completely in the figure legend. It is very difficult to make out what the difference in colour for each panel means.

- In the last paragraph of the introduction, it would provide additional support and context if you referenced other work with similar findings. These papers (<https: doi.org="" 10.1186="" s13059-023-02880-6="">, <https: doi.org="" 10.1101="" 2023.06.06.543963="">) would support the statement, “In general, indels are better tolerated in protein termini than in secondary elements.”

Reviewed by Priyanka Bajaj, Karson Chrispens, James Fraser, Willow Coyote-Maestas (UCSF)

Competing interests

The authors declare that they have no competing interests.