Photo by Scott Graham on Unsplash

A common task during claim frequency modelling in an insurance setting is handling the different levels of exposure. Exposure is a broad term, Insurancepedia defines it as “susceptibility to losses or risks”, but in our example, it will be analogous with time.

We expect higher claim frequency if an entity is subject to a certain risk for a year than if it were subject to the risk for only a week. And if we have historical claim information, but not all the policies we observed lasted for a year, that has to be taken into account during the modelling. …


Photo by Kevin Jarrett on Unsplash

Manually setting the seeds of a (pseudo) random generator is a double-edged sword:

  • it does mean that running the same script will always yield the same results, thus making the experiment reproducible;
  • but it might also take away more than what you intended from the randomness, thus making the experiment potentially useless.

The following happened to a … ahem … friend of mine a while ago. In retrospect, it’s a really obvious mistake, but I don’t see this coming up frequently, so I’m pretty sure other people can run into this problem too.

Let’s see what my friend did wrong…


Photo by DJ Johnson on Unsplash

I still remember the euphoria I felt when I first dived headfirst into the world of Data Science. You mean to tell me that all this information is just out there, available 24/7, completely free? From tutorials to code bits and complete packages, written by like-minded individuals, covering every possible tiny niche area of this vast discipline? Yes, please!

However, I soon realised that free information is not as great as it seems, the quality of writings varies greatly, and online data science guides are simply not to be trusted.

Hang on, is this going to be a long and…


Photo by Emily Morter on Unsplash

I’m a big advocate of using simple models, whenever the problem allows them. It’s all fun and games with the fancy neural networks, but if the good old linear regression gets the job done, why not use it?

However, linear regression being a seemingly simple concept, there are two issues with it:

  • the theory is often underestimated — you assume you know how things work just because you have already used them hundreds of times;
  • the model is very accessible, so there is a lot of material on the topic, everyone has their opinion, and it’s super easy for false…



Image by the Author

If you have an analytical mindset, and ever went on a virtual apartment-hunting trip, you have probably run into the issue I have personally always found infuriating: you can almost never search by property size, let alone by individual room sizes. However, this information is quite often available in the form of a floorplan image. It’s just a lot of zooming and manual typing to gather the data…

Well, I figured this would be an excellent opportunity to practice image recognition skills and create a script that can turn an image into a nice and clean data table. …


Screenshot by Author

Shiny is an R package that lets you build interactive web apps. All you need is R, no HTML, CSS, or JavaScript — although you certainly have the option to enhance your app with them. You can run the app on your computer, host on your own server, or use RStudio’s cloud service.

In this post, I am going to walk through the process of building a simple data analysis app from scratch. …


Photo by Omar Sotillo Franco on Unsplash

OpenAI’s Gym is (citing their website): “… a toolkit for developing and comparing reinforcement learning algorithms”. It includes simulated environments, ranging from very simple games to complex physics-based engines, that you can use to train reinforcement learning algorithms. OpenAI’s other package, Baselines, comes with a number of algorithms, so training a reinforcement learning agent is really straightforward with these two libraries, it only takes a couple of lines in Python.

Of course, there is not much fun in only using the built-in elements, you either use the algorithms as benchmarks when you test your own algorithm, or you build a…


Photo by Jakob Braun on Unsplash

Word2vec is definitely the most playful concept I’ve met during my Natural Language Processing studies so far. Imagine an algorithm that can really successfully mimic understanding meanings of words and their functions in the language, that can measure the closeness of words along the lines of hundreds of different topics, that can answer more complicated questions like “who was to literature what Beethoven was to music”.

I thought it would be interesting to visually represent word2vec vectors: essentially, we can take the vectors of countries or cities, apply principal component analysis to reduce the dimensions, and put them on a…


Image by DarkWorkX from Pixabay

This is the fourth post in my ongoing series in which I apply different Natural Language Processing technologies on the writings of H. P. Lovecraft. For the previous posts in the series, see Part 1 — Rule-based Sentiment Analysis, Part 2—Tokenisation, Part 3 — TF-IDF Vectors.

This post builds heavily on the concept of the TF-IDF vectors, a vector representation of a document, based on the relative importance of individual words in the documents and the whole corpus. As a next step, we are going to transform those vectors into lower-dimension representation using Latent Semantic Analysis (LSA). …

Mate Pocs

Writing about Data Science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store