About Project

‘Email Spam Filter’ is a model that was built in Python and implemented to GUI using Python Tkinter. This model is intended to classify an email as either it is spam or ham. This model performs data preprocessing to ‘clean’ the data, extract the feature, vectorize the words using TF-IDF Vectorizer, perform classification using Logistic Regression, and implement it to GUI using Python Tkinter. The source code can be accessed on my GitHub.

GitHub - nadyatyandra/Email-Spam-Filter

Dataset

The ‘emails_V2.csv’ dataset consists of email subject and body in English and its classification, either spam or ham. The dataset could be accessed from here. The ‘spelling_variants_valid.csv’ dataset is also used to normalize the data as there were some possibilities of spelling and typos in the dataset. The dataset could be accessed from here.

Asset

The ‘message-mail-envelope-email-spam-inbox_108649.ico’ asset is used as the icon in the GUI window. The asset could be accessed from here.

1. Data Preprocessing

The data preprocessing is intended to ‘clean’ the data, thus it became a data that was ready to extract.

1.1 Data Preprocessing - Handling Missing Data

The 3683 missing data would affect the model’s accuracy. Thus, we remove all columns with null values.

original_dataset.isnull().sum()

Untitled

original_dataset.dropna(inplace = True)
original_dataset.isnull().sum()

Untitled

1.2 Data Preprocessing - Case Folding

All text is transformed into lower case.

dataset = original_dataset.copy()
dataset.text = dataset.text.str.lower()
dataset.head(5)