‘Email Spam Filter’ is a model that was built in Python and implemented to GUI using Python Tkinter. This model is intended to classify an email as either it is spam or ham. This model performs data preprocessing to ‘clean’ the data, extract the feature, vectorize the words using TF-IDF Vectorizer, perform classification using Logistic Regression, and implement it to GUI using Python Tkinter. The source code can be accessed on my GitHub.
GitHub - nadyatyandra/Email-Spam-Filter
The ‘emails_V2.csv’ dataset consists of email subject and body in English and its classification, either spam or ham. The dataset could be accessed from here. The ‘spelling_variants_valid.csv’ dataset is also used to normalize the data as there were some possibilities of spelling and typos in the dataset. The dataset could be accessed from here.
The ‘message-mail-envelope-email-spam-inbox_108649.ico’ asset is used as the icon in the GUI window. The asset could be accessed from here.
The data preprocessing is intended to ‘clean’ the data, thus it became a data that was ready to extract.
The 3683 missing data would affect the model’s accuracy. Thus, we remove all columns with null values.
original_dataset.isnull().sum()
original_dataset.dropna(inplace = True)
original_dataset.isnull().sum()
All text is transformed into lower case.
dataset = original_dataset.copy()
dataset.text = dataset.text.str.lower()
dataset.head(5)