A novel sequence-based negative sampling approach for improving protein-protein interactions prediction using machine learning techniques | [Journal of Theoretical and Applied Information Technology • 2022]

Author(s):

1. M. SAYED BARKAT: Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt

2. SHERIN M. MOUSSA: Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt

3. NAGWA L. BADR: Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt

Abstract:

Protein-protein interactions (PPIs) have been involved in numerous diseases' progression in drug discovery. Although PPIs prediction is a crucial and well-studied task in bioinformatics, they still lack thorough investigations for several proteins. The cost of understanding PPIs and identifying protein-protein noninteractions (PPNIs) using sequence alignment make the current computational methods inefficient, so identifying PPNIs without applying sequence alignment has become a necessity. In this research, a machine learning approach is proposed for PPIs prediction based on protein sequence information, in which we introduced “Features-based Negative Generation” which is a novel approach for identifying PPNIs samples. This method measures sequence features' similarity without alignment for an affordable computational feasibility. After PPNIs identification the Conjoint Triad (COT) and Epitopes are used for features extraction and results of both are compared to achieve higher accuracy with less time consumption. Five machine learning techniques were investigated to learn from the interacting pairs sequence, obtaining PPI features. Support vector machine (SVM) with polynomial and RBF kernel functions, Linear SVM, Tree Model (TM) and Linear Model, and the (TM) achieved the best result with an accuracy of 97.8%. The experimentation of PPIs prediction using generated negative dataset and COT using 343 features achieved an accuracy of 97.8%, versus 93% using random negative dataset using COT also. Applying Epitopes with our PPNIs dataset using 21 features achieved an accuracy of 94.5% versus 92.5% with random negative dataset, which indicates that identified PPNIs datasets are clearer, less noise and prediction of PPI using identified PPNIs is more accurate. We compared PPI prediction accuracy using identified PPNIs which extracted using our method with that obtained by other methods in the literature, and we found improvement in our favor of between 2 and 7%. Considering Epitopes for features extraction is faster than COT by an average of 83%.

Page(s): 5070-5095

DOI: DOI not available

Published: Journal: Journal of Theoretical and Applied Information Technology, Volume: 100, Issue: 16, Year: 2022

Keywords:

machine learning , drug discovery , ProteinProtein Interaction , Epitopes , Ppnis Sampling , Biological Pathways , ProteinProtein Negative Interactions , Conjoint Triad Method

References:

References are not available for this document.

Citations

Citations are not available for this document.

Citations

Downloads

Views