Tokenization with example

Avinash
2 min readJun 1, 2021

import nltk

# Code to download all the nltk packages.
# A box will pop up please select all and download all the packages.
nltk.download()

paragraph = 'Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always ending with a stinger. There are over 2,500 described species. They mainly live in deserts but have adapted to a wide range of environments. Most species give birth to live young, and the female cares for the juveniles while their exoskeletons harden, transporting them on her back. Scorpions primarily prey on insects and other invertebrates, but some species take vertebrates. They use their pincers to restrain and kill prey. Scorpions themselves are preyed on by larger animals. Their venomous sting can be used both for killing prey and for defence. Only about 25 species have venom capable of killing a human. In regions with highly venomous species, human fatalities regularly occur.'

# Tokenizing Sentence
sentence = nltk.sent_tokenize(paragraph)

# Tokenizing word
word = nltk.word_tokenize((paragraph))
# To check the length.
print(len(sentence))
print(len(word))

  • So in tokenization upon using the above code the words would be separated into tokens. Meaning all the words present in the sentence would be divided into sentences.
  • The default separator would be ‘.’ but it can be changed.
  • In the word_tokenize() the whole paragraph would be separated into word. In this case, the default separator would be ‘space’.
  • Hope you understand.

--

--

Avinash

Currently Pursuing my Post graduation at Manipal University. LinkedIn Profile: www.linkedin.com/in/avinash-kumar-60396710b