Bringing AI to video and images

Using AI for managing images and videos at scale

Nadav Soferman
© Shutterstock / perfectlab

AI is finding a wide variety of uses for images and video, including auto-cropping and resizing, automating image tagging, and creating video auto-previews. This article examines five examples of how AI is used to enable image and video management at scale at Cloudinary.

Managing the lifecycle of images and videos at scale has become a huge challenge for developers. Maintaining sites with thousands of media assets and tonnes of user-generated content is impossible without armies of designers and developers or applying automation.

I know this from personal experience. Originally a team of consultants, my Cloudinary co-founders and I were repeatedly solving image and video management related requests manually. We saw a huge opportunity to automate these processes – and Cloudinary was born.

Here are five examples of how we apply AI across our platform to enable image and video management at scale.

SEE ALSO: What data should AI be trained on to avoid bias?

Image auto-cropping and resizing

Cropping images accurately, fast and at scale is challenging given the huge number of devices and browsers out there. It requires delivering the same image in many aspect ratios, and potentially cropping closer or wider on the main subject, depending on its size.

When cropping, the most important parts of an image should remain visible, even central. That is easier said than done. Recently a fashion retailer unintentionally cropped out the shoes it was promoting but because the feet of the model weren’t in the centre of the frame. Another approach is to crop an image based on mathematical pixel analysis where you focus on the region where the pixels are sharper than others. Though this method is powerful, often it isn’t enough.

To get auto-cropping right, you need to look at an image as a human eye would do. We use deep learning-based media transformations for visual content to detect the subjects in an image that would be most likely to capture a person’s attention. To do this, we feed the image auto-cropping deep learning mechanism with tonnes of images and corresponding human input. This teaches the machines to identify the important regions in images, regardless of their subject and layout. The process involves advanced computations performed by GPU-based hardware clusters that process millions of crop requests on-the-fly.

But deep learning can even do more. Take the example of the retailer that cropped out the shoes it wanted to sell. Thanks to another deep learning algorithm we can give the highest preservation (“don’t crop”) priority to specific objects or categories, like shoes.

Video auto-cropping and resizing

Getting video correctly displayed is a big challenge. This is because many videos are made with the horizontal aspect ratio even though they are consumed on mobiles vertically. This means that mobile viewers would need to flip their devices to watch these videos correctly. However, this rarely ever happens. Some sites solve this format problem by manually adding black bars or a blurry version of the video behind it. This doesn’t usually look nice and ruins the user experience. But it’s not only mobile devices that cause format issues. A huge number of videos are consumed on social channels like Instagram that display videos in square format.

To crop videos well you need to ensure that the most important aspects will be displayed inside any format for each scene. For example, if you display a football game video, you probably want to ensure that the ball is in the centre of any frame. As with images, we use deep learning algorithms to analyse the frames from the video and identify the areas that are most interesting to the human eye. From this analysis, a heat map is produced and then used to intelligently crop the video. The cropped video will follow the most interesting area of the video throughout the duration, ensuring that the videos retain all the important features, while filling the screen regardless of the aspect ratio.

Video previews

If you have a site with multiple videos thumbnails, loading and displaying all these at once gets quite messy and impacts site performance. Therefore, video previews that show a few seconds to make the visitor interested, are becoming more popular. Creating a video preview is quite an art on its own. If there are only a few, your designers can edit these previews manually. But if hundreds or thousands of videos are involved, you need automation.

As with cropping, we use similar deep learning algorithms to determine which segments in the original video would appeal to humans. Then we create a graph of the relevant parts and select the most interesting ones that fit the number of seconds you want to preview.

Categorising and tagging image and video content

If your site has a lot of user-generated content, you might want to understand what is inside the videos and images so you can match them with the right audiences, make them searchable and increase engagement. However, manually categorising/tagging large volumes of images would take up too much time and resource. AI content-recognition tagging is a great way to add intelligence and categorise assets.

Fortunately, companies like Google, Amazon, Microsoft, and others are offering AI-based auto-tagging. When you upload or update an image to our platform, you can request automatic categorisation from multiple engines. What you receive are the categories identified by each of the engines and their confidence scores.

SEE ALSO: Four ways AI and integration will come together in 2020

Background removal

Another use case for AI is removing image backgrounds. These days, e-commerce websites usually include high-quality product photos on clean and sleek backgrounds. To do this, the background needs to initially be transparent. At scale, manually editing the images is too slow and cumbersome.

We automatically remove backgrounds, using deep learning to recognise the main objects in the image, segment them, define the background versus the foreground and which pixels are to be removed and displayed. All of these factors depend on the context and composition of the scene. For production-level background removal, the segmentation maps of foreground vs. background pixels must be near-perfect to make seamless background removal possible.

These are just five examples of how we use AI on our platform. However, there are many more opportunities. It’s an exciting field and I can’t wait to see where the new AI-based media management automation will take us.


Nadav Soferman

Nadav Soferman is co-founder and CPO of Cloudinary, a provider of leading cloud-based image and video management solutions. A software developer at his core, he has worked at various Internet startups for the past 17 years, developing web & mobile software and managing successful development teams.

Leave a Reply