Text Preprocessing Python menggunakan NLTK

nlp

Preprocessing data merupakan langkah penting dalam membangun model Machine Learning dan tergantung pada seberapa baik data telah preprocessing. Dalam NLP, text preprocessing adalah langkah pertama dalam proses membangun model, diantaranya adalah:

Text mining adalah bidang multidisiplin yang mencakup pengambilan informasi (information retrieval), analisis teks, ekstraksi informasi, kategorisasi, pengelompokan (clustering), visualisasi, penambangan data (data mining), dan pembelajaran mesin (machine learning). Konsep dasar text mining adalah menemukan informasi baru dari data tekstual yang sebelumnya tidak dikenal atau informasi rahasia dengan menggunakan teknik ekstraksi yang berbeda.

Tahapan Text Preprocessing:

  • Lowercase
  • Remove numbers
  • Remove whitespace from text
  • Remove punctuation
  • remove stopwords
  • lemmatize string
preprocessing

Sumber data saya mengguanakan Abstract Jurnal Komputasi yang di ambil dari jurnal komputasi FMIPA Unila. Saya membuatnya dengan menggunakan framework Django dan Database MySQLJurnal Komputasi FMIPA Unila

Pertama buat file .py yang berisi fungsi-fungsi diatas dengan nama file Preprocess.py dan import beberapa library

import nltk 
import string 
import re
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize

Lowercase

Python Fucntion Code:

#lowercase
def text_lowercase(text): 
    return text.lower()

Output: 

Remove numbers

Python Fucntion Code:

def remove_numbers(text): 
    result = re.sub(r'\d+', '', text) 
    return result 

Output: 

Remove punctuation

Python Fucntion Code:

# remove punctuation 
def remove_punctuation(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator)

Output: 

Remove whitespace from text

Python Fucntion Code:

# remove whitespace from text 
def remove_whitespace(text): 
    return  " ".join(text.split()) 

Output: 

Remove stopwords

Python Fucntion Code:

# remove stopwords function 
def remove_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text

Output: 

Lemmatize string

Python Fucntion Code:

# lemmatize string 
lemmatizer = WordNetLemmatizer()
def lemmatize_word(text): 
    stop_words = set(stopwords.words("english")) 
    stop_words.update(('one','two','and','I','A','And','So','arnt','This','When','It','many','Many','so','cant','Yes','yes','No','no','These','these'))
    word_tokens = word_tokenize(text) 
    # provide context i.e. part-of-speech 
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens if word not in stop_words] 
    new_sentence = ' '.join(lemmas)
    return new_sentence

Output: 

Kesimpulan

Itulah potongan beberapa fungsi preprosesing yang jika digabung akan menjadi seperti ini:

import nltk 
import string 
import re
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize 

#lowercase
def text_lowercase(text): 
    return text.lower()

# Remove numbers 
def remove_numbers(text): 
    result = re.sub(r'\d+', '', text) 
    return result 

# remove whitespace from text 
def remove_whitespace(text): 
    return  " ".join(text.split()) 

# remove punctuation 
def remove_punctuation(text): 
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator)

# remove stopwords function 
def remove_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 

# lemmatize string 
lemmatizer = WordNetLemmatizer()
def lemmatize_word(text): 
    stop_words = set(stopwords.words("english")) 
    stop_words.update(('one','two','and','I','A','And','So','arnt','This','When','It','many','Many','so','cant','Yes','yes','No','no','These','these'))
    word_tokens = word_tokenize(text) 
    # provide context i.e. part-of-speech 
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens if word not in stop_words] 
    new_sentence = ' '.join(lemmas)
    return new_sentence

Temen-temen bisa download source code pada github saya, jika ingin ditanyakan bisa japri lewat email yak.

https://www.kumis.id/cara-membuat-autofield-dan-autocomplete-menggunakan-laravel-dan-ajax/

Recommended Articles

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *