Sign in

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Importing all the required libraries.

paragraph = 'Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always ending with a stinger. There are over 2,500 described species. They mainly live in deserts but have adapted to a wide range of environments. Most species give birth to live young, and the female cares for the juveniles while their exoskeletons harden, transporting them on her back. Scorpions primarily prey on insects…


# Importing the nltk libraries

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords


paragraph = 'Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always ending with a stinger. There are over 2,500 described species. They mainly live in deserts but have adapted to a wide range of environments. Most species give birth to live young, and the female cares for the juveniles while their exoskeletons harden, transporting them on her back. Scorpions primarily prey on insects and…


import nltk

# Code to download all the nltk packages.
# A box will pop up please select all and download all the packages.
nltk.download()

paragraph = 'Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always ending with a stinger. There are over 2,500 described species. They mainly live in deserts but have adapted to a wide range of environments. Most species give birth to live young, and the female cares for the juveniles while their exoskeletons harden, transporting…


# Importing the required library
import nltk

paragraph = ‘Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always ending with a stinger. There are over 2,500 described species. They mainly live in deserts but have adapted to a wide range of environments. Most species give birth to live young, and the female cares for the juveniles while their exoskeletons harden, transporting them on her back. Scorpions primarily prey on insects and other invertebrates, but some species take vertebrates. They use their pincers to restrain and kill prey. Scorpions themselves are preyed on by larger animals. Their venomous sting can be used both for killing prey and for defense. Only about 25 species have venom capable of killing a human. In regions with highly venomous species, human fatalities regularly occur.’

# Data Preprocessing ( Cleaning the data )

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# from nltk.stem import PorterStemmer
# I am directly using the lemmatization over stemming you can use steeming as well but lemmatization gives you better results.

lem = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
after_lem = []

for i in range(len(sentences)):
reg = re.sub(‘[^a-zA-Z]’,’ ‘,sentences[i])
reg = reg.lower()
reg = reg.split()
reg = [lem.lemmatize(word) for word in reg if not word in set(stopwords.words(‘english’))]
reg = ‘ ‘.join(reg)
after_lem.append(reg)

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
x = tf.fit_transform(after_lem).toarray()
print(x)

# Summary

# Basically what happens when we use the stopwords is that for example when we wake a sentence like “Hello my name is Avinash Kumar”

# When we apply the stop words on this, all the words like ‘my’,’ is’ and ‘this’ all these which don’t have any meaning on their own.

# What is Lemmatization?

# Basically Lemmatization means for example we have a word like

# History
# Hist
# Histories

# Upon lemmatization on these 3 words it will be converted to history and replaced ( Which has some meaning)

# After that we used a for loop

# 1) First we removed all the unnecessary symbols like ‘!’,’.’,’,’ and so on cause these have no use to us.

# 2) We have done that using regular expression where we have to use the sub-function it basically replaces all the other variables apart from a-zA-z with spaces(‘ ‘).

# 3) After that we have converted all the sentences to lower and split them. We converted them because we have the word ‘Good’ and ‘good’ both are of the same meaning if we don’t convert them then they would be taken as two separate words.

# 4) Then we removed all the stop words and appended them into a new list.

# 5) To create the TF — IDF model we used something called Tfidfvectorizer

# 6) We create an object and fit transform it and Tfidfvectorizer basically creates a matrix of the words.

# How to understand the output

# In the output we see an array with a set of numbers so what are these and how to understand these

# we have a total of 11 sentences after sent_tokenize()

# So each of these 11 sentences are converted based on the frequency after the stop words

# So after that we will calculate two variables one is TF and IDF and a combination of these two together will give us the required output.

# i.e TF we calculate it using the formula TF = (N.O of repetition of words in a sentence / N.O words in the sentence)

# IDF we calculate it using the formula IDF = log(N.O of sentence / N.O of sentence the word is in)

# Then the combination of TF*IDF would give us the output array.

# Hope you understand.


Exploratory Data analysis

Hello, guy’s so today we will focus on the basics steps that we need to perform on a data set in EDA where the data is continuous.

First, let's import all the required libraries for the analysis.

This contains all the functions and libraries and codes for Exploratory Data Analysis.

# Numpy Library

import NumPy as np

# Pandas Library

import pandas as pd

# Matplot Library

import matplotlib.pyplot as plt

# Seaborn Library

import seaborn as sns

Step-1

The first step is to always understand the variables present in the data set.

  • This includes knowing the shape and info by…

Avinash

Currently Pursuing my Post graduation at Manipal University.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store