A script that automatically collects room sizes from apartment floorplans… or at least it tries

Image by the Author

If you have an analytical mindset, and ever went on a virtual apartment-hunting trip, you have probably run into the issue I have personally always found infuriating: you can almost never search by property size, let alone by individual room sizes. However, this information is quite often available in the form of a floorplan image. It’s just a lot of zooming and manual typing to gather the data…

Well, I figured this would be an excellent opportunity to practice image recognition skills and create a script that can turn an image into a nice and clean data table. …


A step-by-step guide for beginners

Screenshot by Author

Shiny is an R package that lets you build interactive web apps. All you need is R, no HTML, CSS, or JavaScript — although you certainly have the option to enhance your app with them. You can run the app on your computer, host on your own server, or use RStudio’s cloud service.

In this post, I am going to walk through the process of building a simple data analysis app from scratch. …


How to set up, verify, and use a custom environment in reinforcement learning training with Python

Photo by Omar Sotillo Franco on Unsplash

OpenAI’s Gym is (citing their website): “… a toolkit for developing and comparing reinforcement learning algorithms”. It includes simulated environments, ranging from very simple games to complex physics-based engines, that you can use to train reinforcement learning algorithms. OpenAI’s other package, Baselines, comes with a number of algorithms, so training a reinforcement learning agent is really straightforward with these two libraries, it only takes a couple of lines in Python.

Of course, there is not much fun in only using the built-in elements, you either use the algorithms as benchmarks when you test your own algorithm, or you build a…


A 2-D visual representation of the principal components created from the Word2vec vectors of European capitals — a.k.a. a map.

Photo by Jakob Braun on Unsplash

Word2vec is definitely the most playful concept I’ve met during my Natural Language Processing studies so far. Imagine an algorithm that can really successfully mimic understanding meanings of words and their functions in the language, that can measure the closeness of words along the lines of hundreds of different topics, that can answer more complicated questions like “who was to literature what Beethoven was to music”.

I thought it would be interesting to visually represent word2vec vectors: essentially, we can take the vectors of countries or cities, apply principal component analysis to reduce the dimensions, and put them on a…


Applying dimension-reduction techniques to convert TF-IDF vectors into more meaningful representations of H. P. Lovecraft’s stories.

Image by DarkWorkX from Pixabay

This is the fourth post in my ongoing series in which I apply different Natural Language Processing technologies on the writings of H. P. Lovecraft. For the previous posts in the series, see Part 1 — Rule-based Sentiment Analysis, Part 2—Tokenisation, Part 3 — TF-IDF Vectors.

This post builds heavily on the concept of the TF-IDF vectors, a vector representation of a document, based on the relative importance of individual words in the documents and the whole corpus. As a next step, we are going to transform those vectors into lower-dimension representation using Latent Semantic Analysis (LSA). …


Building TF-IDF representations of H. P. Lovecraft’s stories using Python, scikit-learn, and spaCy in order to determine which stories are close to each other.

Image by F-P from Pixabay

This is the third part of my Lovecraft NLP series. In the previous posts, I was discussing rule-based sentiment analysis and word counts with tokenisation.

Our approach has been quite simplistic so far in the series, we were basically breaking down the text into words, and counted them in some way. The next step in the world of NLP is going to be looking at TF-IDF vectors, which stands for Term Frequency — Inverse Document Frequency. …


Using spaCy, a Python NLP library, to analyse word usage in H. P. Lovecraft’s stories.

Image by LUM3N from Pixabay

This is the second blog post in my series in which I analyse the works of H. P. Lovecraft through the lens of Natural Language Processing. In the first post, we kicked things off by analysing the overall sentiment of the stories. We can’t postpone this any longer, we have to talk about the basis of all NLP analysis: tokenisation. I thought it would be fun to pair this with a particular question, so in this post, we are going to find which words Lovecraft used the most in each story, and in his literary work combined.

One of the…


Using VADER, a rule-based sentiment analysis library in Python, to rank H. P. Lovecraft stories from darkest to… least dark.

Image by F-P from Pixabay

I’ve been considering doing a Natural Language Processing project for a while now, and I finally decided to do a comprehensive analysis of a corpus taken from literature. I think classical literature is a really interesting application of NLP, you can showcase a wide array of topics from word counts and sentiment analysis to neural network text generation.

I picked H. P. Lovecraft’s stories as a subject of my analysis. He was an American writer from the early 20th century, famous for his weird and cosmic horror fiction, with a huge influence on modern pop culture. …


How to split, save, and extract text from PDF files using PyPDF2 and PDFMiner, demonstrated with the complete works of H. P. Lovecraft.

Photo by Aleksandar Pasaric from Pexels

I don’t think there is much room for creativity when it comes to writing the intro paragraph for a post about extracting text from a pdf file. There is a pdf, there is text in it, we want the text out, and I am going to show you how to do that using Python.

In the first part, we are going to have a look at two Python libraries, PyPDF2 and PDFMiner. As their name suggests, they are libraries written specifically to work with pdf files. We will discuss the different classes and methods we need.

Then, in the second…


A step-by-step guide on implementing a latent factor recommendation engine using the Surprise library in Python.

Photo by Deva Williamson on Unsplash

This post is the last piece of my Python Surprise recommendation series in which I present the techniques I used in my boardgame recommendation engine project. (See my GitHub repo for the whole project.)

Previous posts in the series:

Part 1: How to Build a Memory-Based Recommendation System using Python Surprise: I recommend reading this post first if you are not familiar with the topic, especially the Data Import and Data Preparation steps since they are identical for the memory-based and model-based approach.

Part 2: My Python Code for Flexible Recommendations: This post contains the additional code I wrote for…

Mate Pocs

Writing about Data Science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store