We had seen Machine learning and Deep learning terminology in our previous posts.Lets see the most used gloassary in Natural Language Processing.
word2vec creates a high dimensional feature space. More complicated embeddings such as these is well suited for recurrent neural networks where the word ordering is also taken into consideration.
Natural Language Processing (NLP)
NLP is the area of machine learning tasks focused on human languages. This includes both the written and spoken language.
Vocabulary
The entire set of terms used in a body of text.
Out of Vocabulary
In NLP, data used to train our model consists of a finite number of vocabulary terms. Very often, we will encounter out of vocabulary terms when using our model for inference. Typically, a common placeholder is assigned for these terms.
Corpus (Plural: Corpora)
A corpus is a collection of text. A corpus can be a collection of movies reviews, internet comments or conversations between two people.
Documents
Document refers to a body of text. A collection of documents make up a corpus. For instance, a movie review or an email are examples of a document.
Preprocessing
The first step to any NLP task is to preprocess the text. The goal of preprocessing is to “clean” the text by removing as much noise as possible. Common preprocessing steps are described in the techniques section.
Tokenization
The process of breaking a large chunk of text into smaller pieces. This is usually done so that each small piece, or token, can be mapped to a meaningful unit of information. If we choose to break our text on the word level, each word becomes its own token.
(Word) Embeddings
Each token is embedded as a vector before it can be passed to a machine learning model. While generally referred to as word embeddings, embeddings can be created on the character or phrase level as well. Following the techniques section is an entire section on different types of embeddings.
n-grams
A contiguous sequence of n tokens in a given text. In the phrase the day is young, we have bi-grams (the, day), (day, is), (is, young).
Transformers
A new deep learning architecture introduced in 2017 which surpassed several prior benchmarks for NLP tasks. Transformers compensate for two of the Recurrent Neural Network’s shortcomings: ability to parallelize computations and is better suited to learn long-term dependencies between words.
Techniques
Parts of Speech (POS)
The syntactic function of a word. We are all probably familiar with the different parts of speech in English: noun, verb, adjective, adverb … etc.
Parts of Speech Tagging
The process of assigning a parts of speech tag to each token in the text.
Normalization
The process of reducing similar tokens to a canonical form. For instance, if we believe hello and Hello are for all intents and purposes the same, we can normalize our text by mapping both terms to hello.
Stop Words
These words are ignored prior to any preprocessing or modelling tasks. Stop words are chosen based on their insignificance to the NLP task at hand. For instance, the nltk list for English stop words identifies common words such as a, to, can for exclusion.
Lemmatization
A normalization technique of grouping inflected terms to their base form conditioned on the parts of speech of the text. For example, walking and walked would both be mapped as walk.
Stemming
Similar to lemmatizing, stemming also reduces inflected terms to their base forms. The only difference is that the parts of speech tag is not used to determine the base form.
"Note on lemmatizing vs stemming: you might think, why would I ever need to use stemming? Wouldn’t this reduction step be more accurate if we knew the parts of speech? One clear advantage of stemming is that it is much faster. Another is that it eliminates the margin of error created by automatic parts of speech taggers."
Common NLP Tasks
Sentiment Analysis
The automated process of detecting sentiment from text. A common application of sentiment analysis is to determine whether a review is positive or negative, or whether the text supports a certain sentiment.
Machine Translation
This one is self explanatory — all of your favourite automated translation tools are NLP applications.
Machine (Reading) Comprehension/Question Answering
Machine Comprehension, usually carried through Question Answering, is the task of automatically “understanding” the text. It’s usually tested through reading comprehension questions where the input is a contextual document and a set of questions that can be answered using the document. The AI infers the answers based on these inputs.
Named Entity Recognition (NER)
The automatic extraction of relevant entities (such as names, addresses and phone numbers) from an unstructured document.
Information Retrieval/Latent Semantic Indexing
The automatic retrieval of information from a large system (think web search engines). The problem is defined as returning the right document(s) when provided with a specific query.
Embeddings
Bag of Words
This is the simplest method of embeddings words into numerical vectors. It’s not often used in practice due to its oversimplification of language, but commonly found in examples and tutorials.
Consider these documents:
Document 1: high five
Document 2: I am old.
Document 3: She is five.
This gives us the vocabulary: high, five, I, am, old, she, is. For simplicity, we will ignore punctuation and normalize by converting to lower case. We can construct a matrix which represents the number of times each vocabulary term occurs in a document.
This gives us the Bag of Words representation of each word and document. Move horizontally to get the word representation: high is [1, 0, 0]. Move vertically to get the document representation: document 1 is [1, 1, 0, 0, 0, 0, 0].
TF-IDF (term frequency — inverse document frequency)
Unlike Bag of Words, TF-IDF considers the relative importance of each term to document. The vector representation of each term and document can be extracted in a similar fashion as Bag of Words.
The TF-IDF statistic for term i in document j is typically calculated as:
The document vectors can be used as features for a variety of machine learning models (SVM, Naive Bayes, Logistic Regression … etc).
word2vec
Trained over large corpora, word2vec uses a shallow neural network to determine semantic and syntactic meaning from word co-occurrence.
Context Dependent Embeddings
word2vec embeddings are independent of context — each word will be mapped to the same vector irregardless of its surrounding context. Embeddings such as BERT, ELMo create word embeddings that vary based on the context of the phrase.
***Thank You***
0 Response to "Natural Language Processing Terminology"
Post a Comment