- Eskişehir Technical University Journal of Science and Technology A - Applied Sciences Engineering
- Volume:23 Issue:1
- COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON RE...
COMPARATIVE OF SUCCESS OF KNN WITH NEW PROPOSED K-SPLIT METHOD AND STRATIFIED CROSS VALIDATION ON REMOTE HOMOLOGUE PROTEIN DETECTION
Authors : Fahriye GEMCİ, Turgay İBRİKÇİ, Ulus ÇEVİK
Pages : 87-108
Doi:10.18038/estubtda.970169
View : 12 | Download : 7
Publication Date : 2022-03-30
Article Type : Research Paper
Abstract :In this study, a remote homologous protein detection problem, which is a problem related to the field of bioinformatics and has made a great contribution in the field of medicine, is discussed. Protein sequences taken from the SCOP database, which is an important and widely used database for proteins, were tested for remote homologue protein detection in this study. Feature vectors were obtained from the protein sequences using the bag-of-words model. These obtained feature vectors were classified using the k-nearest Neighbor classifier algorithm. In this classification, the different distances used were Bray Curtis, Euclidean, Minkowski, Dice, Jaccard, Chebyshev, Cosine, SokalSneath, correlation, matching coefficient, RogersTanimoto, SokalMichener, Canbera, Hamming, Kulczynski, and RussellRao on the k-nearest Neighbor classifier for remote homologue protein detection. Two different new methods is proposed for preventing the imbalanced data problem. The first of these is special k-fold value and the other is novel k-split method. It is observed that the k-nearest Neighbor algorithm with the Bray Curtis distance and cross validation with special k-fold value and novel k-split method show the most successful performance, with 98.9% and 83.8% accuracy and 77% and 92% ROC score, respectively.Keywords : Remote Homologue Protein, k nearest Neighbor, Bag of words model, Distances, k fold Stratified Cross Validation