This is the world's only free online tracer, comparable in conversion quality to paid programs and services. Just upload an image and get instant results, without registration or software installation. Create beautiful designs with your team. Use Canva's drag-and-drop feature and layouts to design, share and print business cards, logos, presentations and more.
To convert vector images, follow these steps: Use the 'Local file' or 'Online file' buttons to specify how to upload the image to the server. Use the 'local file' if you need to convert a file from your computer, in order to specify a file on the Internet, select 'Online file' and in the appeared field paste the link to the file. The Studio Artist Vectorizer takes a raster image and converts it into a flat color vector representation. The vectorization process can either try to mimic the source as closely as possible or can be configured to create a wide range of different stylistic effects.
1. A Quick Example
Let’s look at an easy example to understand the concepts previously explained. We could be interested in analyzing the reviews about Game of Thrones:
Review 1: Game of Thrones is an amazing tv series!
Review 2: Game of Thrones is the best tv series!
Review 3: Game of Thrones is so great
In the table, I show all the calculations to obtain the Bag-Of-Words approach:
Each row corresponds to a different review, while the rows are the unique words, contained in the three documents.
2. Implementation with Python
Let’s import the libraries and define the variables, that contain the reviews:
We need to remove punctuations, one of the steps I showed in the previous post about the text pre-processing. We also transform the string into a list composed of words.
After we achieve the Vocabulary, or wordset, which is composed of the unique words found in the three reviews.
We can finally define the function to extract the features in each document. Let’s explain step by step:
- we define a dictionary with the specified keys, which corresponds to the words of the Vocabulary, and the specified value is 0.
- we iterate over the words contained only in the document and we assign to each word its frequency within the review.
We can finally obtain the Bag-of-Words representations for the reviews. In the end, we obtain a data frame, where each row corresponds to the extracted features of each document.
Didn’t it seem one of the boring exercises given during a programming course? It’s like that but applied in a real dataset. Great! We obtained what we wanted.
3. Comparison with Scikit-Learn
In the previous section, we implemented the representation. Now, we want to compare the results obtaining, applying the Scikit-learn’s CountVectorizer. First, we instantiate a CountVectorizer object and later we learn the term frequency of each word within the document. In the end, we return the document-term matrix.
CountVectorizer provides the get_features_name method, which contains the uniques words of the vocabulary, taken into account later to create the desired document-term matrix X. To have an easier visualization, we transform it into a pandas data frame.
We compare it with the output obtained before.
So, the results match and the task is solved!
Until now I kept the stop words to keep the tutorial simple. But there is also the possibility to remove the stop words without adding any line of code in Sklearn. We only need to add an argument in the CountVectorizer function:
We can also do another experiment. One possibility is to take into account the bigrams, instead of the unigrams. For example, the two words, “tv series”, match very well together and are repeated in every review:
Aren’t the combination of words interesting? It seems to make sense for “tv series”, while “game thrones” bigram loses the meaning and the word “of” since it’s a stop word. So, in some context, remove all the stop words isn’t always convenient.
That’s it! Bag-Of-Words is quite simple to implement as you can see. Of course, we only considered only unigram (single words) or bigrams(couples of words), but also trigrams can be taken into account to extract features. Stop words can be removed too as we saw, but there are still some disadvantages. The order and the meaning of the words are lost using this method. For this reason, other approaches are preferred to extract features from the text, like TF-IDF, which I will talk about in the next post of the series. Thanks for reading. Have a nice day!
Vectorizer Archives 2019
Eugenia Anello has a statistics background and is pursuing a master’s degree in Data Science at the University of Padova. She enjoys writing data science posts on Medium and on other platforms. Her purpose is to share the knowledge acquired in a simple and understandable way.
Vectorizer Archives 2
You can follow her on Linkedin and Medium.