- Researcher
- Volume:04 Issue:02
- An Investigation On The Execution Of The Document Clustering Process On Internet News
An Investigation On The Execution Of The Document Clustering Process On Internet News
Authors : Metin Oktay Boz, Jale Bektaş
Pages : 113-119
View : 52 | Download : 60
Publication Date : 2024-12-31
Article Type : Other Papers
Abstract :Numerous investigations have focused on recognizing Internet news as valid documents. This study encompasses the application of text mining techniques to generate a TF-IDF matrix and the subsequent automatic identification and categorization of an optimal number of clusters. The research examines the impact of K-Means document clustering on internet news articles, integrating the User Engagement dataset which includes articles from various esteemed publishers. Prior to implementing the K-Means algorithm, several preprocessing steps were undertaken to prepare the TF-IDF matrix. Due to the absence of the content attribute data, the description attribute was selected for document clustering. During preprocessing, extraneous ASCII symbols, punctuation marks, line breaks, emails, mentions, internet extensions, stopwords, and words outside the 2 to 21 character range were removed. Words were stemmed to consolidate different forms of the same root. The Elbow method was employed on the TF-IDF matrix to determine the optimal number of clusters, followed by an analysis of results using prominent words and word clouds. Ultimately, five clusters of document counts 797, 408, 89, 364, and 8755 were identified.Keywords : K-Means, TF-IDF, Kümeleme, Döküman Kümeleme