Teaching computers how to see

5 min readFeb 20, 2019

Badi, as the leading room rental marketplace, is inherently full of media content. We actually mean it, users have uploaded millions of pictures since we launched three years ago. If you have a spare room, you’ll upload plenty of pictures of your cool apartment that make it stand out. Moreover, you’ll want your future roommates know how you are, and in your profile you’ll include a good looking selfie, but if you’re a pet lover, you could also add a snap of your four legged friend, and hence describing more about your personality than just typing it on your bio.

This leaves our Data Scientists with a vast amount of unstructured data to analyze. We already use structured data from user profiles, such as age, occupation, gender, etc. to understand their behaviour in the platform, study how new and exciting features fare, craft better recommendations for our users, and so on. Our next logical step were pictures. There are many analyses we could do, and improve our user experience if we harness this information. What knowledge can we get out of them? Do some users only contact rooms that are full o natural light? Do they prefer to live with people that depict themselves practicing sport? Let’s try to answer some of these questions, but before, we’ll need to translate these images in a way a computer can understand. Let’s extract features out of them.

A one megapixel image, very poor quality for current camera standards, is composed of 1 million pixels, which have three color channels, red, green and blue. This means that, for a machine learning model, an image are 3 million columns! Having this vast amount of features is an issue for machine learning models. We need to reduce the dimension of these images to something bearable for a computer, but that still encode the essence of them. That’s where Neural Networks kick in.

Stylized representation of a neural network.

Luckily for us we’re standing on the shoulder of giants. In recent years we’ve experienced an enormous advance in Artificial Neural Networks. Roughly speaking, they are systems vaguely inspired by human brains which are capable of learning complex patterns, and neurons are mathematical operations on the data. Whether the data are closings of the S&P, the content tweets, audio files, or images is up to you.

One very active area of research in Neural Networks has been image classification. Tasks of this field could be telling apart telling what’s in a picture, detecting objects in the road, which is required to teach your Tesla to drive, or detecting hotdogs. Networks used for these tasks are called Convolutional Neural Networks (CNN), because they use convolutions, which are mathematical operations that emulate the response of neurons to visual stimuli, only seeing in its surrounding area. Applying the convolution over all image means striding through it in patches.

Great animation of applying a convolution to a 6x6 image, by https://github.com/vdumoulin/conv_arithmetic.

In recent years we’re experiencing an enormous democratization of AI, with many frameworks being open sourced, and already trained state-of-the-art networks being released, ready to be used. The most time, and probably resource, intensive part in designing a Deep Neural Network is gathering labeled images, and training it, so it’s a luxury to be able to use them without having cluster of computers running for days.

However, many of these networks may not be suitable for our desired task. CNNs are normally trained to detect thousands of different objects, and we may be interested in telling apart double beds from single beds, not a dog from a car. Here’s where transfer learning comes in. It consists transferring the knowledge of a neural network trained for a general task, i.e. detecting thousands of different objects, to your specific task, such as detecting a good looking apartment. This is done by taking intermediate layers of the neural network, called embeddings, and use them as an input for your specific problem. These embeddings are an intermediate representation of the image, not very specific to the main problem, but able to encode important information about the image. The output of this process translates an image into a vector of fixed size of floating point numbers.

This representation makes very little sense to humans. They’re the evaluation of a long list of chained derivatives in the specific value of that image, still a very high dimensional problem for a human to understand. In order to see its results, we encoded some hundred thousands mages about rooms and users in Badi, and clustered similar embeddings. For the neural networks, we used Keras to extract encodings, playing around with various architectures, such as VGG16 and Resnet50, with similar results. In both cases, we removed the fully connected layer that map neurons to the outputs of the model.

Once we had the embeddings of the images, we used K-Means clustering from Scikit learn:

And here are some examples of plots of images clustered together. Bear in mind that this process has been obtained in a completely unsupervised manner.

As you can see, clustering embeddings produces groups om images with similar scenery, facial traits, etc. that they resemble each other. This is because intermediate layer in neural networks encode shapes or patterns, not the specific class where they belong.

If you love Data Science, are passionate about making an astounding product through Artificial Intelligence, and really want to revolutionize how people find their next home, Badi is the place for you. Have a look at our jobs page, or reach out to me with all your questions.

Teaching computers how to see

Written by Adrian Pino