NLP Pipeline

Natural Language Processing

Ability of computers to process and analyse large amounts of natural language data

NLP is a combination of Computer Science, Artificial Language and Human Language
N-Grams **** Play an important role in NLP

An NLP Pipeline consists of:

Data Acquisition
Text Preparation
Feature Engineering
Modelling

Data Acquisition

Process of acquiring data from different sources

Text Preparation

Working on data to make it representable / processable

Cleaning
- Html tags cleaning
- Removing emojis
- Spelling checker
Basic Preprocessing
- Tokenisation (sentence or word)
- Stop word cleaning
- Stemming / Lemmatisation
- Lower case conversion
- Language detection
Advance Preprocessing
- POS tagging
- Parsing
- Coreference resolution

Embedding / Vectorisation / Feature Engineering

Converting words and sentences into numeric vectors for use with computational algorithms

The process of representing a token/a sentence as a numerical vector
It uniquely identifies a work
It makes it possible to apply arithmetic operations

Some types of Embeddings are:

One Hot Encoding
Bag of Words (BOW)
Term Frequency-Inverse Document Frequency (TF-IDF)
Word2Vec
GloVe
Fast Text
Transformers

Embeddings always have defined dimensions

Sparse and Dense Vectors are two categorisation for representing words
Sparse Embedding Techniques are highly dependent on the corpus and particularly the vocabulary size
Dense Embedding Techniques on the other hand are not dependant entirely on corpus and are flexible

One Hot Encoding

Every word is represented as a unique ‘One-Hot’ binary vector of 0s and 1s

CleanShot 2023-10-15 at 17.33.30@2x

For every unique word in the vocabulary, the vector contains a single 1 and rest all values are 0s, the position of 1 in the vector uniquely identifies a word.

Bag of Words (BOW)

A text (such as a sentence or a document) is represented as the bag (multi-set) of its words, disregarding grammar and even word order but keeping multiplicity.

Measure of the presence of known words
Vocabulary of known words

Reviews

CleanShot 2023-10-15 at 17.37.51@2x

Unique Words

CleanShot 2023-10-15 at 17.38.06@2x

BOW for Review 4 Document

CleanShot 2023-10-15 at 17.38.36@2x

After converting the documents into such vectors we can compare different sentences and calculate the Euclidean distance between them so as to check if two sentences are similar or not. If there would be no common words distance would be much larger and vice-versa.

BOW doesn’t work very well when there are small changes in the terminology as we are using here, we have sentences with similar meaning but with just different words.

Term Frequency-Inverse Document Frequency (TF-IDF)

Generates Sparse Vectors / Embeddings

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

It is established as follow:

Count the number of times each word appears in a document
Calculate the frequency that each word appears in a document out of all the words in the document

Term Frequency

Term frequency (TF) shows how frequently an expression (term, word) occurs in a document

TF = \frac{N o . o f T im es W or d O cc u rs I n a Doc u m e n t}{T o t a l N o . o f W or d s}

Inverse Document Frequency

IDF is a measure of how much information the word provides, i.e., if it’s common or rare across all documents

It is the logarithmically scaled inverse fraction of the documents that contain the word

I D F = lo g \frac{T o t a l N o . o f Doc u m e n t s}{N o . o f Doc u m e n t s t ha t C o n t ain t h e W or d}

TF-IDF gives larger values for less frequent words in the document corpus.

TF-IDF value is high when both IDF and TF values are high i.e the word is rare in the whole documents but frequent in a document.

Example

Sentence 1: The car is driven on the road

Sentence 2: The truck is driven on the highway

CleanShot 2023-10-15 at 18.02.07@2x

Shortcomings

Can lead to Sparse Vectors which will ultimately add to our computational cost

Pointwise Mutual Information (PPMI)

Generates Sparse Vectors / Embeddings

Generate information / gain using the probabilities of the single and co-occurred words

lo g_{2} \frac{P ( w 1 , w 2 )}{P ( w 1 ) \times P ( w 2 )}

Word2Vec

Generates Dense Vectors / Embeddings

Vectorisation technique that utilises prediction techniques
‍

CBOW

Continuous Bag of Words

We train a Neural Network using the surrounding words of a word with a defined context window

CBOW is used when the corpus is very large

Understanding the Continuous Bag of Words (CBOW) Model
Context Word
- Words surrounding the actual word we need
- Given as Input
Target Word
- Word we need to predict and work on
- Given as Output
Self-Supervised Learning Algorithm
- We have a Sliding Window
- Each vocabulary word is made as a target one by one while the words in its context window are feed as input
- The underlying Neural Network is also called as a Black Box
  
  The reason is that we cannot define the meanings of the weights in our Neural Network and also cannot assign embeddings / weights own their own
- The weights learned / utilised by the model at the end / output layer are our word embeddings and each word has their own weights, thus embedding
- The count of the weights for each word is also our embedding size

Architecture

CleanShot 2024-03-13 at 01.58.53@2x

Skip Gram

Skip Gram is used when the corpus is small

Demystifying Neural Network in Skip Gram Language Modelling
Target Word
- Word for which we need to predict the embedding and work on
- Given as Input
Context Word
- Words surrounding the actual word we need
- Given as Output
Self-Supervised Learning Algorithm
- We use the same sliding window as in the case of CBOW
- Neural Network is again used as the underlying model to predict embeddings
  - However, the architecture of the network is reversed from as it was in CBOW
Negative Sampling is used to feed the words and data into the network

Architecture

CleanShot 2024-03-13 at 01.57.40@2x

Language Modelling

A language model learns to predict the probability of a sequence of words.

Explorer

Recent Notes

Visionary GenAI

Industry and Competitive Analysis

NLP Pipeline

Data Acquisition

Text Preparation

Embedding / Vectorisation / Feature Engineering

One Hot Encoding

Bag of Words (BOW)

Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency

Inverse Document Frequency

Example

Shortcomings

Pointwise Mutual Information (PPMI)

Word2Vec

CBOW

Architecture

Skip Gram

Architecture

Language Modelling

Graph View

Table of Contents