# Vectorizer Archives

In information retrieval and text mining, TF-IDF, short for term-frequency inverse-document frequency is a numerical statistics (a weight) that is intended to reflect how important a word is to a document in a collection or corpus. It is based on frequency.

The TF-IDF is a product of two statistics term: tern frequency and inverse document frequency. There are various ways for determining the exact values of both statistics.

Before jumping to TF-IDF, let’s first understand Bag-of-Words (BoW) model

BoW model creates a vocabulary extracting the unique words from document and keeps the vector with the term frequency of the particular word in the corresponding document. Simply term frequency refers to number of occurences of a particular word in a document. BoW is different from Word2vec. The main difference is that Word2vec produces one vector per word, whereas BoW produces one number (a wordcount).

The Paint Synthesizer – Image Operations – Vectorizer Modes in Studio Artist This video is going to concentrate on these three Studio Artist modes, The Paint Synthesizer, The Vectorizer and Image Operations. Let’s start with the Paint Synthesizer. This is the Painting Mode in Studio Artist and like all modes, The Paint Synthesizer has. Super Vectorizer is a professional vector-trace tool that enables conversion from a raster (bitmap) image–such as JPEG, BMP, or PNG–to a scalable vector graphic with a. September 28, 2016. Super Vectorizer 1.6.6 – Professional vector trace tool.

For example:

- John likes to watch movies. Mary likes movies too.
- Mary also likes to watch football games.

Based on these two documents, we can construct a list for each document:

`['John', 'likes', 'to', 'watch' movies', 'Mary', 'likes', 'movies', 'too']`

`['Mary', 'also', 'likes', 'to', 'watch', 'football', 'games']`

Representing each bag-of-words as a `JSON`

object, we get:

and

where each key is the word and each value is the number of occurences of that word in the given text documents.

The drawbacks of Bag-of-Words model are:

- The term ordering is not considered
- The rareness of term is not considered
- It results in extremely large feature dimensions and sparse vectors.

We will use `CountVectorizer`

of Sci-kit Learn to convert a collection of text documents to a matrix of token counts:

The default tokenization in `CountVectorizer`

removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to `CountVectorizer`

.

You can also use a custom stop word list that you provide, which we will see an example below!

Notice that here we have 11 unique words. So we have 11 columns. Each column in the matrix represents a unique word in the vocabulary, while each row represents the document in our dataset.

#### How to provide stop words in a list?

### Vectorizer Archives 2019

In this example, we provide a list of words that act as our stop words. Notice that the shape has gone from (2,11) to (2,9) because of the stop words that were removed, `['to', 'too']`

. Note that we can actually load stop words directly from a file into a list and supply that as the stop word list. One of the lists is given in https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/resources/stopwords.txt

we can also use `min_df`

and `max_df`

arguments to get rid of some words.

we can ignore words that are too common with `max_df`

. `max_df`

looks at how many documents contained a term, and if it exceeds the `max_df`

threshold, then it is eliminated from consideration. The MAX_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

#### BoW model by hand

For term frequency in a document $tf(t, d)$, the simplest choice is to use the raw count of a term in a document, i.e., the number of times that a term $t$ occurs in a document $d$. If we denote the raw count by $f_{t, d}$, the simplest tf scheme is $tf(t, d) = f_{t, d}$. Other possibilities:

Binary: $tf(t, d) = 1$ if $t$ occurs in $d$ and 0, otherwise.

Term frequency is adjusted for document length: begin{equation} frac{f_{t,d}}{sum_{t^{‘} in d} f_{t^{‘}, d}} end{equation} where the denominator is total number of words (terms) in the document $d$.

Logarithmically scaled frequency: begin{equation} tf(t,d) = log(1 + f_{t, d}) end{equation}

Augmented frequence, to prevent a bias towards longer documents, e.g., raw frequency divided by the raw frequency of the most occuring term in the document begin{equation} tf(t, d) = 0.5 + 0.5 frac{f_{t, d}}{max { f_{t^{‘}, d}: t^{‘} in d }} end{equation} This formulation is also called double normalization 0.5. If 0.5 is another number K instead of 0.5, it is called double normalization K.

The inverse-document frequency is a measure of how much information the word provides, i.e., if it is a common or rare across all the documents. It determines the weight of rare words across all documents in the corpus. For example, the most commonly used word in english language is “the” which represents 7% of all words written or spoken. You could not deduce anything about a text given the fact that it containts the word “the”. On the other hand, words like “good” and “awesome” could be used to determine whether a rating is positive or not. Inverse-document frequency is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient).

begin{equation}idf(t, D) = log left(frac{N}{mid {d in D: t in d } mid} right)end{equation}

where $N$ is the total number of documents in the corpus, $N = mid D mid$ and $mid {d in D: t in d } mid$ is the number of documents where the term $t$ appears (i.e., $tf(t, d) neq 0$). If the term is not in the corpus, this will lead to division by zero. It is therefore common to adjust the denominator to $left( 1 + mid {d in D: t in d } mid right)$.

Then tf-idf is calculated as

begin{equation}tf-idf (t, d, D) = tf(t, d) times idf(t, D)end{equation}

A high weight in tf-idf is reached by a high term frequency in the given document and a low document frequency of a term in the whole collection of documents, the weights hence tend to filter out common terms. Since the ratio inside the idf’s log function is always greater than or equal to 0, the value idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches to , bringing idf and thus, tf-idf closer to 0.

Each word or term has its respective tf-idf score. Putting simply, the higher the tf-idf score (weight), the rarer the term and vice versa.

This example can be found in Wikipedia page of this subject:

Suppose that we have term count tables of a corpus consisting of only two documents:

The calculation of tf–idf for the term “this” is performed as follows:

[begin{split}tf ( this, d_{1}) &= frac {1}{5} = 0.2tf (this, d_{2}) &= frac {1}{7} approx 0.14idf (this, D) &= log left({frac {2}{2}}right)=0end{split}]So tf–idf is zero for the word “this”, which implies that the word is not very informative as it appears in all documents.

[begin{split}tf-idf (this, d_{1}, D) &= 0.2 times 0 = 0 tf-idf (this, d_{2}, D) &= 0.14 times 0 = 0end{split}]The word “example” is more interesting - it occurs three times, but only in the second document:

[begin{split}tf (example, d_{1}) &= frac {0}{5} = 0tf (example, d_{2}) &= frac {3}{7} approx 0.429idf (example, D) &= log left( frac {2}{1} right) = 0.301end{split}]Finally,

[begin{split}tf-idf (example, d_{1}, D) &= tf (example, d_{1}) times idf (example, D) = 0 times 0.301=0tf-idf (example, d_{2}, D) &= tf (example, d_{2}) times idf (example, D) = 0.429 times 0.301 approx 0.129end{split}](using the base 10 logarithm).

Below we have 5 toy documents. We are going to use this toy dataset to compute the tf-idf scores of words in these documents.

In order to start using `TfidfTransformer`

you will first have to create a `CountVectorizer`

to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc.

Now it’s time to compute the IDFs. Note that in this example, we are using all the defaults with `CountVectorizer`

. You can actually specify a custom stop word list, enforce minimum word count, etc… Now we are going to compute the IDF values by calling `tfidf_transformer.fit(word_count_vector)`

on the word counts we computed earlier.

Note that the idf formula above differs from the standard textbook notation that defines the idf as `idf(t) = log [ n / (df(t) + 1) ])`

where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t.

If `smooth_idf=True`

(the default), the constant “1” is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: `idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1`

.

For example, the term `cat`

appears in two documents and we have 5 documents. In other words, `n=5`

and `df('cat') = 2`

. Thefore, inverse-document frequency for this word is:

Notice that the words ‘mouse’ and ‘the’ have the lowest IDF values. This is expected as these words appear in each and every document in our collection. The lower the IDF value of a word, the less unique it is to any particular document.

Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. Let’s compute tf-idf scores for the 5 documents in our collection.

The first line above, gets the word counts for the documents in a sparse matrix form. We could have actually used `word_count_vector`

from above. However, in practice, you may be computing tf-idf scores on a set of new unseen documents. When you do that, you will first have to do `cv.transform(your_new_docs)`

to generate the matrix of word counts.

Then, by invoking `tfidf_transformer.transform(count_vector)`

you will finally be computing the tf-idf scores for your docs. Internally this is computing the `tf * idf`

multiplication where your term frequency is weighted by its IDF values.

Now, let’s print the tf-idf values of the first document to see, by placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.

Notice that only certain words have scores because they only appear in the first document. The first document is “the house had a tiny little mouse” all the words in this document have a tf-idf score and everything else show up as zeroes. Notice that the word “a” is missing from this list. This is possibly due to internal pre-processing of `CountVectorizer`

where it removes single characters. Note that the more common the word across documents, the lower its score and the more unique a word is to our first document (e.g. ‘had’ and ‘tiny’) the higher the score.

We can repeat the same process for all other documents.

### Tfidfvectorizer Usage

### Vectorizer Archives Free

This is another way of computing TF-IDF weights for terms in the document.

With `Tfidfvectorizer`

you compute the word counts, idf and tf-idf values all at once. It’s really simple.

### Vectorizer Archives -

Here’s another way to do it by calling `fit`

and `transform`

separately and you’ll end up with the same results.