Tokenisation
import pandas as pdcontent = ['hey boss!','great..','nice@']df=pd.DataFrame(content,columns={'Sms content'})
df| Sms content | |
|---|---|
| 0 | hey boss! |
| 1 | great.. |
| 2 | nice@ |
import re
def tokenize(text):
tokens=re.split('\W+',text)
return tokensdf['tokenized_text']=df['Sms content'].apply(lambda row : tokenize(row.lower()))
df.head()| Sms content | tokenized_text | |
|---|---|---|
| 0 | hey boss! | [hey, boss, ] |
| 1 | great.. | [great, ] |
| 2 | nice@ | [nice, ] |