Word2vec from scratch in Python

aquib (45)in #nlp • 6 years ago

Word2Vec is a method used for creating word embeddings. Word embedding is a feature engineering technique used in NLP where words or phrases from a vocabulary mapped to a vector of real numbers. Word2Vec have two different methods, continuous Bag of Words (CBOW) and Skip-gram method. We can use any one of them and create our word embeddings.
CBOW: predicting the word given its context/surrounding words.
Skip-Gram: Predicting the context from the given word.

Skip Gram model: In skip-gram, the current word is taken as Input and predicts words within a certain range before and after the current word. the figure https://arxiv.org/pdf/1301.3781.pdf

Example: Ashraf Marwan is remembered most famously for spying for the Egyptian intelligence agency.
After removing the stop words and all in lowercase:

ashraf marwan remembered famously spying egyptian intelligence agency

If we take the certain range 2, also called WINDOW_SIZE
Screen Shot 2018-10-16 at 6.01.50 PM.png