The Power of Deep Learning: How Our Simple Mask Detector Evolved into a Multi-Faceted Surveillance System

9 min readDec 14, 2021

In this article, I will present to you how this project could be implemented in a relatively short time, the motivation behind it, and what challenges we have faced along the way.

The motivation behind the project

It all began with a competition at the university I belong to (Transilvania University from Brasov, Romania). They announce to offer 1 million euros to 10 teams that come with novel project proposals with the potential to digitalize the university (100k euros per team to buy the equipment + students scholarship).

Being in a midst of a global pandemic, the idea of developing an artificial intelligence-based system related to this came very naturally. So, me and the other team member started a brainstorming session. At that moment, face mask detection algorithms were very popular and we decided to implement something similar.

But it would be way too simple for a 100k euro project, right?

And that’s right! So we’ve decided to complicate the system a little bit by adding some more features:

determine the temperature of people
determine the social distance between people
aggregate all the collected data and display them into a dashboard web platform
generate alarms (audio + visual) when people do not comply with the required measures

And the most important feature: to do all these jobs in a non-invasive way. This means that people will behave just like they used to when entering a building (without employing a guardian with a pistol thermometer or systems where you should put your hand to scan temperature etc). Hence, the system has a novel feature when compared to existing solutions.

The final idea of the proposal was to build multiple such systems (or devices, as I will name them now) and place them at the entrance of every university building.

The architecture of the Covid-19 surveillance system — Figure 1. The architecture of the entire Covid-19 surveillance system. Multiple devices located at entrances of the main university buildings send data to the web platform

With all these in mind, we wrote a ~30 page business proposal, and luckily it was accepted! The only different thing was that the university leadership wanted to see a prototype first (only one device) and then scale it up.

How it’s working

To recap, the idea was to build multiple devices (which can recognize the presence/absence of the face mask, temperature + social distance) and all those devices should send data to a web platform. With this in mind, we’ve thought (and implemented) the following architecture.

Figure 2. The software architecture of the system

As can be seen in the figure above, there are 3 main components, which I will describe separately:

a. Video stream processor for device N

This is the software that runs on the device (more specifically on the microcontroller), which means that all the algorithms lie here.

For the mask detection task, we just employed a pre-trained Yolo-v4-tiny from GitHub with the idea of further fine-tuning it using our own data. Surprisingly, it worked pretty well at first even in low light conditions, so we’ve decided to use it as it is and preserve our time and energy for building the other software-related parts.

The challenge here was to make it run in real-time. Using the pure python implementation we’ve got a pretty low FPS (I don’t remember exactly the value but trust me, it was slow). So what we’ve done was to run the detection algorithm (Yolo-v4-tiny) directly through the Darknet framework (which is written in C&Cuda and it’s obviously faster than Python). And it worked! The FPS increased significantly. We’ve called the Darknet framework in Python using the subprocess library and read the output of the console.

The high-temperature detection task was pretty straightforward to implement. We took the stream from the thermal camera, extract the temperature (based on pixels intensities) and compare it with a threshold. However, we’ve still encountered some challenges at this task:

▹We could not find a good API for the thermal camera, that could give us the video stream and real temperatures. There are some available APIs but they do not work properly. So, we had to manually extract the frames and make a mapping between the intensities of the red channel and the actual temperatures.

▹We had to launch this as a separate process and synchronize it with the one responsible for mask detection (which was a python script that calls a C script)

Figure 4. The temperature algorithm. To simulate high temperatures we’ve used a lighter

The social distance measurement feature, at first glance, may seem an extremely easy feature to implement. You just have to take the centers of the face detected bounding boxes, compute the distance between them in pixels and then map it physical measurement units (centimeters). Well, this will work like a charm only if the detected persons are always in the same Z plane. In other words, if they are always at the same fixed distance from the camera. Since we’ve promised a robust and non-invasive system, this constraint could not be acceptable.

In order to approximate the real distance, we have to compute the euclidian distance in 3d space. The problem is that we’re not using a depth camera, so we have only information about the X and Y coordinates of the faces and not Z (only in the 2d plane).

Not comes the question: how can we estimate the depth of faces from only 2d images?

Our idea here was to use the area of the bounding boxes. Starting from the assumption that all the faces have approximately the same size which means that if the area of 2 bounding boxes (corresponding to 2 detected faces) is the same then those 2 people are in the same Z plane (same depth). Also, if a bounding box is smaller, that means the detected person is further away from the camera. And vice-versa: the smaller the size of the bounding box, the closer to the camera that person is. With this in mind, we've employed a small neural network to approximate the real distance between people using the X and Y coordinates from the 2d frame and the area of the bounding boxes as the Z coordinate.

Figure 5. The social distance algorithm. When too persons get too close to each other, a warning message appears

All those three algorithms send data to the middle man module as JSON events, via POST requests. From every single frame.

b. Middle man

This module has the following roles, described in their order :

Receive all the events from the device/devices.
Process them:
- Decide if a specific event (for example a person detected without a mask) belongs to the same person as the one from the previous frame. This can be very nicely done using a tracking algorithm, but since it’s an expensive task from a computational point of view we’ve done it in a more manual way, looking at coordinates and datetime)
- Convert the JSON events into our specific data types (basically statistical python objects that I will describe in the web platform section)
- Store them in a database
Expose all the collected data for the web platform through REST services

So nothing crazy here, just a module written in Python with Flask that receives data from the device and makes it available in a processed format to the web platform. This module should not be hosted on the microcontroller, but on a separate server with the web platform.

c. Web platform

Finally the web platform. This is basically a dashboard, which displays the following types of statistics:

For the current day, the total number of people that entered a building, divided into 4 categories:
- those who wear a mask
- those who don’t wear a mask (or wear it but in incorrectly)
- those with a high body temperature
- those who don’t respect the social distance
For the current day, the total number of people that complies with the Covid-19 restrictions in contrast to those who doesn’t (this time in percentages)
Cumulative traffic of people entering a building for the current day (again, divided into two categories: the ones that comply with the restrictions and the ones who doesn’t)
For every day of the week, the total number of people that complies with the restrictions in contrast with the ones that don’t

We believe that using those statistics you can get a high-level overview about when, where, and why the risk of spreading Covid-19 virus may increase. And obviously what can be done to prevent it.

This module was entirely written using React with Bootstrap framework, starting from a free template found on this amazing website. Again, we were not trying to reinvent the wheel but to deliver an MVP as fast as possible.

I believe that the whole software architecture described above could be done in a more elegant/optimized way but our expertise here is limited since we are both data scientists 😄.

Hardware components

The heart&brain of the device are represented by a Jetson Xavier microcontroller. The eyes are represented by a bispectrum Hikvision camera (thermal + RGB). Those two are the main components.

Among the secondary components, we have a wireless router for accessing the camera, a 21-inch monitor as the display of the device, an extension cord with 4 inputs, and all related power cables.

Figure 6. The device viewed from the front, lateral and back

All those components are attached to a black wooden board, which is attached to a metal tripod.

The total cost of this prototype it’s around 4k euro. This means that it can be easily scaled for every building of the university, considering the overall budget. However, this cost can be further optimized.

Another big challenge of this project was that the equipment was delivered only ~7 months after the project already started (considering a timeline of 10 months from start to delivery). It looks like buying things through auctions is a very slow process from a bureaucratic point of view.

Future work

In the end, I think this project looks pretty fancy but to be completely honest it didn’t add that much value. It warns you if the restrictions are not respected but it doesn’t stop you in any way. One of the initial ideas was to save screenshots with those people, which can be used to sanction them later. But due to GDPR rules, we couldn’t do that.

One cool thing that can be done instead with the assembled device could be an attention detector for physical presentations. Imagine placing this device in a classroom. During the course, it can analyze when the students lost interest in the lesson, by looking at their faces and analyzing their emotions. This can help teachers to adjust the boring parts of the lesson, decide when to give additional breaks, and many more.

Conclusion

It was a fun project to do and I definitely learned a lot of new stuff working on it. Even if you have a lot of already trained models to choose from and think it will be an easy project to do, there is still a long way to the productization and robustness.

Also, I hope this will inspire you to participate in such competitions. There are a lot of things that you can learn while earning some decent money. Also, you have the chance to play with some very expensive equipment.