Natural Language Processing Terminology

We had seen Machine learning and Deep learning terminology in our previous posts.Lets see the most used gloassary in Natural Language Processing.

Natural Language Processing (NLP)

NLP is the area of machine learning tasks focused on human languages. This includes both the written and spoken language.

Vocabulary

The entire set of terms used in a body of text.

Out of Vocabulary

In NLP, data used to train our model consists of a finite number of vocabulary terms. Very often, we will encounter out of vocabulary terms when using our model for inference. Typically, a common placeholder is assigned for these terms.

Corpus (Plural: Corpora)

A corpus is a collection of text. A corpus can be a collection of movies reviews, internet comments or conversations between two people.

Documents

Document refers to a body of text. A collection of documents make up a corpus. For instance, a movie review or an email are examples of a document.

Preprocessing

The first step to any NLP task is to preprocess the text. The goal of preprocessing is to “clean” the text by removing as much noise as possible. Common preprocessing steps are described in the techniques section.

Tokenization

The process of breaking a large chunk of text into smaller pieces. This is usually done so that each small piece, or token, can be mapped to a meaningful unit of information. If we choose to break our text on the word level, each word becomes its own token.

(Word) Embeddings

Each token is embedded as a vector before it can be passed to a machine learning model. While generally referred to as word embeddings, embeddings can be created on the character or phrase level as well. Following the techniques section is an entire section on different types of embeddings.

n-grams

A contiguous sequence of n tokens in a given text. In the phrase the day is young, we have bi-grams (the, day), (day, is), (is, young).

Transformers

A new deep learning architecture introduced in 2017 which surpassed several prior benchmarks for NLP tasks. Transformers compensate for two of the Recurrent Neural Network’s shortcomings: ability to parallelize computations and is better suited to learn long-term dependencies between words.

Techniques

Parts of Speech (POS)

The syntactic function of a word. We are all probably familiar with the different parts of speech in English: noun, verb, adjective, adverb … etc.

Parts of Speech Tagging

The process of assigning a parts of speech tag to each token in the text.

Normalization

The process of reducing similar tokens to a canonical form. For instance, if we believe hello and Hello are for all intents and purposes the same, we can normalize our text by mapping both terms to hello.

Stop Words

These words are ignored prior to any preprocessing or modelling tasks. Stop words are chosen based on their insignificance to the NLP task at hand. For instance, the nltk list for English stop words identifies common words such as a, to, can for exclusion.

Lemmatization

A normalization technique of grouping inflected terms to their base form conditioned on the parts of speech of the text. For example, walking and walked would both be mapped as walk.

Stemming

Similar to lemmatizing, stemming also reduces inflected terms to their base forms. The only difference is that the parts of speech tag is not used to determine the base form.

"Note on lemmatizing vs stemming: you might think, why would I ever need to use stemming? Wouldn’t this reduction step be more accurate if we knew the parts of speech? One clear advantage of stemming is that it is much faster. Another is that it eliminates the margin of error created by automatic parts of speech taggers."

Common NLP Tasks

Sentiment Analysis

The automated process of detecting sentiment from text. A common application of sentiment analysis is to determine whether a review is positive or negative, or whether the text supports a certain sentiment.

Machine Translation

This one is self explanatory — all of your favourite automated translation tools are NLP applications.

Machine (Reading) Comprehension/Question Answering

Machine Comprehension, usually carried through Question Answering, is the task of automatically “understanding” the text. It’s usually tested through reading comprehension questions where the input is a contextual document and a set of questions that can be answered using the document. The AI infers the answers based on these inputs.

Named Entity Recognition (NER)

The automatic extraction of relevant entities (such as names, addresses and phone numbers) from an unstructured document.

Information Retrieval/Latent Semantic Indexing

The automatic retrieval of information from a large system (think web search engines). The problem is defined as returning the right document(s) when provided with a specific query.

Embeddings

Bag of Words

This is the simplest method of embeddings words into numerical vectors. It’s not often used in practice due to its oversimplification of language, but commonly found in examples and tutorials.

Consider these documents:

Document 1: high five

Document 2: I am old.

Document 3: She is five.

This gives us the vocabulary: high, five, I, am, old, she, is. For simplicity, we will ignore punctuation and normalize by converting to lower case. We can construct a matrix which represents the number of times each vocabulary term occurs in a document.

This gives us the Bag of Words representation of each word and document. Move horizontally to get the word representation: high is [1, 0, 0]. Move vertically to get the document representation: document 1 is [1, 1, 0, 0, 0, 0, 0].

TF-IDF (term frequency — inverse document frequency)

Unlike Bag of Words, TF-IDF considers the relative importance of each term to document. The vector representation of each term and document can be extracted in a similar fashion as Bag of Words.

The TF-IDF statistic for term i in document j is typically calculated as:

The document vectors can be used as features for a variety of machine learning models (SVM, Naive Bayes, Logistic Regression … etc).

word2vec

Trained over large corpora, word2vec uses a shallow neural network to determine semantic and syntactic meaning from word co-occurrence.

word2vec creates a high dimensional feature space. More complicated embeddings such as these is well suited for recurrent neural networks where the word ordering is also taken into consideration.

Context Dependent Embeddings

word2vec embeddings are independent of context — each word will be mapped to the same vector irregardless of its surrounding context. Embeddings such as BERT, ELMo create word embeddings that vary based on the context of the phrase.

***Thank You***

Natural Language Processing Terminology

0 Response to "Natural Language Processing Terminology"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel