About Project

‘Indonesian SMS Spam Detector’ is a natural language processing application that was built in Python and implemented to GUI using Python Tkinter. This model is intended to classify a SMS as either it is spam or ham. This model performs data preprocessing to ‘clean’ the data, extract the feature, vectorize the words using TF-IDF Vectorizer, perform classification using Logistic Regression, and implement it to GUI using Python Tkinter. This project was created by me and my team for our Natural Language Processing Final Project. The source code can be accessed on my Github.

GitHub - nadyatyandra/IndonesianSMSSpamDetector

Contributors

  1. Nadya Tyandra - Machine Learning Engineer
  2. Randy Antonio - Machine Learning Engineer
  3. Farrel Rasyad - Machine Learning Engineer

Dataset

The ‘dataset_sms_spam_v1.csv’ dataset consists of SMS texts in Indonesian and its classification, either 0, 1, or 2.

The SMS texts that contain fraud and promotion could be categorized into spam. Further explanation about the dataset could be accessed through this link and the dataset itself could be accessed from here.

The ‘new_kamusalay.csv’ and ‘colloquial-indonesian-lexicon.csv’ dataset are also used to normalize the data as there were some possibilities of spelling and typos in the dataset. The ‘new_kamusalay.csv’ could be accessed from here and the ‘colloquial-indonesian-lexicon.csv’ could be accessed from here.

Asset

The ‘spam.ico’ asset is used as the icon in the GUI window. The asset could be accessed from here and later converted into .ico using this.

1. Importing Libraries

Firstly, we need to import libraries and modules that needed such as Pandas, Regular Expression, Sastrawi, NLTK (Natural Language Toolkit), Scikit-learn, and Tkinter.