Data Pre-Processing in Spam Detection

Anjali Sharma
2015 IJSTE-International Journal of Science Technology & Engineering |   unpublished
Nowadays, most of the people have access to the Internet, and digital world has become one of the most important parts of everybody's life. People not only use the Internet for fun and entertainment, but also for business, banking, stock marketing, searching and so on. Hence, the usage of the Internet is growing rapidly. One of the threats for such technology is spam. Spam is a junk mail/message or unsolicited mail/message. Spam is basically an online communication send to the user without
more » ... ssion. Spam has increased tremendously in the last few years. Today more than 85% of mail /messages received by users are spam. These days, spam is a very serious problem because spamming has become a very profitable business for spammers. Spam email takes on various forms like adult content, selling products or services, job offers etc. Spam costs the sender very little to send but most of the costs are paid by the recipient or the service providers rather than by the sender The cost of spam can also be measured in lost human time, lost server time and loss of valuable mail/messages. In filtering of spam, the data cleaning of the textual information is very critical and important. Main objective of data pre-processing in spam detection is to remove data which do not give useful information about the class of the document. In this paper, the focus is on various pre-processing steps of text data such as noise elimination, stop word removal, and stemming. For stemming, Porter's algorithm has been used. Further, some results, after applying all the data pre-processing steps have been displayed.