Meet's Thoughts
Protein ML Research
Thoughts about proteins as of Nov. 22, 2022
Working on a project using large-scale protein interaction data for protein design. Also thinking about collective intelligence in deep learning. Models like neural cellular automata are interesting: the system's behavior captures some compositionality and is robust to perturbations. A happy face needs eyes and a mouth; you can mess them up and end up with a long mouth and one eye or have two faces competing for space. Why not have two magnesium binding regions on a protein?
Houston, we got prions
Previous thoughts about proteins
Thoughts about proteins as of Dec. 8th, 2021
I'm working on a project to predict protein function descriptions in natural language, and focusing on evaluating the functional descriptions in an automated way. I'm trying to choose a good metric for this, starting my search with BLEU and other measures used in machine translation, and measures mentioned in this paper. If you know a lot about such measures, contact me! For this problem, compared to machine translation, there are differences in the assumptions of what constitutes a good match for a pair of descriptions, and how do we score a set of descriptions with a set of functions for a particular protein.
Prior thoughts
I've been working on function/fold/class discovery for proteins recently.
I'm thinking about neural network-based clustering algorithms, though I know there are possibly much better ways to approach class discovery for proteins (probabilistic programming, energy based models).
I want to learn more about those better approaches, but I still think it's worth exploring adapting the new techniques developed for unsupervised image classification for proteins just to see how they'd do.
Work surrounding mine
Some related work that I think is interesting.
- Vilnis, Luke, and Andrew McCallum. "Word representations via gaussian
embedding." arXiv preprint arXiv:1412.6623 (2014).
- Mikolov, Tomas, et al. "Efficient estimation of word representations in
vector space." arXiv preprint arXiv:1301.3781 (2013).
- Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for
networks." Proceedings of the 22nd ACM SIGKDD international conference on
Knowledge discovery and data mining. 2016.
- Radford, Alec, et al. "Learning transferable visual models from natural
language supervision." arXiv preprint arXiv:2103.00020 (2021)
- Van Gansbeke, Wouter, et al. "Learning to classify images without labels."
arXiv preprint arXiv:2005.12320 (2020).
- Caron, Mathilde, et al. "Deep clustering for unsupervised learning of visual
features." Proceedings of the European Conference on Computer Vision (ECCV).
2018.
- Singh, Rohit, Jinbo Xu, and Bonnie Berger. "Global alignment of multiple protein interaction networks with application to functional orthology detection." Proceedings of the National Academy of Sciences Sep 2008, 105 (35) 12763-12768; DOI: 10.1073/pnas.0806627105
- Ashburner, Michael, et al. "Gene ontology: tool for the unification of
biology." Nature genetics 25.1 (2000): 25-29.
- Zhou, N., Jiang, Y., Bergquist, T.R. et al. The CAFA challenge reports
improved protein function prediction and new functional annotations for
hundreds of genes through experimental screens. Genome Biol 20, 244 (2019).
https://doi.org/10.1186/s13059-019-1835-8
IsoRank (Singh et al. 2008)
IsoRank is a global network alignment algorithm that can be used to detect functionally similar proteins between two interaction networks. It involves two main steps:
- Solving an eigenvalue equation to compute a functional similarity score matrix R between all pairs of proteins between the two networks
- Extracting a set of high-scoring and mutually consistent matches from the R matrix.
We used the first scoring step in NetQuilt, because the similarity profiles that IsoRank computes are pretty informative features for function prediction across multiple species.
SCAN (Semantic clustering by Adopting Nearest neighbors) (Van Gansbeke et al. 2020)
SCAN is a neural network-based clustering algorithm that has been used to classify images in an unsupervised way. This one involves three main steps:
- Self-supervised learning of image features using a neural network, giving a k-nearest neighbors graph in this learned feature space
- Train a softmax layer on top of the embeddings of the previous model using a semantic clustering loss function which enforces neighboring samples in the k-nearest neighbors graph to be in the same class
- Additional training using pseudo-labels extracted from the model with strongly-augmented images in order to refine and increase confidence in predictions
I'm currently exploring how this could work with protein sequence, because like most other biological data collected, most of it is unlabeled, and needs to be categorized.
Back to main page