Advances in Visual AI From Edge to Cloud

Artificial intelligence-powered solutions take video analytics to new levels

Although artificial intelligence (AI) has been the hottest technology trend over the past few years, it is anything but new. In the last quarter of the 20th century, computer vision technology, or “visual AI,” was launched with the aim of emulating human perception of the world.
In the early days, visual AI was often expensive and oversold, as the bulk of the intelligence relied on a handful of meticulously designed image processing filters, at times coupled with simple machine learning classifiers.
However, all of this has changed over the past decade with the transition to “deep learning” convolutional neural networks (CNNs). These networks have replaced legacy computer vision algorithms for detection, classification and segmentation in all practical industrial applications because of significantly improved accuracy resulting from model training on large datasets.
Recent advances in generative AI (GenAI) have brought the technology closer to artificial general intelligence and have already made a profound impact on the capabilities of computer vision. For example, vision transformers (ViTs) and multimodal transformers (such as CLIP) are direct adaptations of transformers, which are the foundation of modern large language models for visual AI use cases.
Visual AI for the Security Industry
In other industries, AI’s impact is fairly nascent, but visual AI has already been widely adopted within physical security in the form of video analytics and it is delivering measurable value today.
Not only do visual AI solutions make humans better at the work they already do (for instance, by improving the accuracy of intrusion detection), but they also act as force multipliers so that organizations can expand the scope of what their team covers and reduce related costs.
In addition to security use cases, organizations can now leverage their video surveillance infrastructure to make improvements in a wide variety of nonsecurity, operational use cases. This can help drive greater efficiency or create new insights to improve business results. Use case examples include monitoring worker productivity and safety compliance, occupancy management, and customer service.
The emergence of GenAI opens a new frontier for AI’s impact on physical security. Compared to CNNs, the key advantages of a GenAI-based approach are deep contextual understanding and the ability to perform a wide range of tasks without specific training.
Whether it is video summarization to gain insights into a vast quantity of recorded video, smart search to aid real-time investigations, or the ability to create complex, multi-stage rules for video analytics, GenAI innovation is just getting started in security.
GenAI technology, however, brings new challenges as well, specifically around video and image authenticity for physical security and law enforcement. Leading industry organizations like C2PA and ONVIF are currently developing standards and solutions to address these important and evolving issues.
GenAI Architectures
Recent advances in visual AI will affect not only the physical security industry, but a wide range of others, including transportation, retail, education and more. Two key visual AI architectures, based on GenAI, will fuel this transformation:
- Interactive analytics for stored video
- GenAI augmentation of CNN-based video analytic pipelines
The first architecture serves emerging use cases that operate on stored video instead of real-time streaming, such as video summarization, video question and answer, and video retrieval by video/image.
- Breaking Down the Video: The input video is divided into smaller chunks to make it easier for AI models to process. These chunks can be of different lengths to capture events that happen over various timescales.
- Processing the Video: The video chunks are decoded into raw frames. These frames are then sampled (picked at certain intervals) and fed into video/audio encoders. These encoders can use different methods, like CNN or ViT, to detect and classify objects more accurately.
- Storing and Using Data: The encoded data (embeddings) are stored in special (vector and graph) databases. Retrieval augmented generation (RAG), which enables users to make queries in natural language about the original video, is performed on these databases.
- Interacting With the Video: Use cases like video question and answer, video summarization, and video retrieval by video/image can be implemented on top of this architecture. The user interaction is implemented in the front-end user interface, which makes queries about the video via RAG.
GenAI models also have the potential to augment state-of-the-art video analytics pipelines to either enable self-learning capabilities or offer complex multi-stage rules for analytics that are otherwise not feasible with CNN-based models.
Rules-based filtering has become a common feature in video management systems (VMS) to quickly find specific objects or events within thousands of hours of video across hundreds of cameras. It is common to see built-in simple filters such as object type, color, license plate number, etc. Some offerings allow users to build custom search filters, but it involves either retraining underlying CNN models or coding, neither of which are achievable by the end consumer without help from the software developer and integrator. Alternatively, solutions built on multi-model GenAI models like CLIP have the ability to offer natural-language search functionality without the need for model retraining.
For example, in response to a crime wave conducted by thieves on motorcycles, law enforcement may want to identify all sightings of motorcycles where the rider is wearing a backpack. It would be quite difficult to train a CNN model for this specific use case (due to lack of sufficient training data), but this complex multi-stage filtering could be achieved by introducing a GenAI model like CLIP to the post-processing stage of a typical object detector pipeline.
GenAI models could also be used to create a reinforcing feedback loop to increase the accuracy of CNN models via self-learning. Vision GenAI models are often more accurate and versatile than their CNN counterparts but usually require more computing power. Hence those GenAI models could be utilized in a feedback loop that is executed outside of the main streaming pipeline to periodically validate the accuracy of the CNN models, while detected errors could be used to re-train the CNN models offline.
Smart Camera and Cloud AI Implementations
There are various approaches to implementing visual AI solutions, the selection of which can depend on, among other things, budget, use case, available bandwidth, data sovereignty, existing infrastructure, and local regulations.
AI is now an option inside the security camera (aka “smart camera”), where analytics can run at the source of the video, providing the lowest possible latency. Whether cameras stream to on-premises video management infrastructure or direct to a cloud video service, the simplicity of AI camera architecture can be appealing. However, it is not always possible to run multiple video analytics on the same video stream (particularly for a heavy AI workload), and the set of available analytics may be limited and can vary by camera manufacturer.
At the other end of the spectrum, cloud-based analytics are increasingly available as part of cloud video or video surveillance as a service (VSaaS) agreements and are simple and easy to manage, given the cloud’s scalability. While latency and bandwidth can be a concern for some use cases or locations, cloud-based analytics are particularly well suited to use cases involving data aggregation, like heatmapping and other trend analyses. In some cases, data privacy policies or local regulations may limit cloud-based adoption.
Edge AI System Architecture
Dedicated edge AI systems or servers, ranging in size from small form factor to rackmount servers, can be deployed on-prem to run one or multiple video analytics across aggregated video streams. These typically take the form of edge AI appliances or servers, “smart NVRs,” or similar AI-enabled video infrastructure. This type of system architecture often provides the most options and flexibility for implementing video analytics (including the ability to run multiple heavy AI analytics against the same camera streams) and can be cost effective as a result of centralizing compute, which then can process all of the aggregate video streams.
In near real-time use cases, which are common in the security industry, low latency is critical. For VSaaS offerings, adding edge compute allows for this, as well as bandwidth savings, with only metadata and related images or video clips uploaded to the cloud.
The primary downside of the edge AI server approach is additional infrastructure to manage. However, AI workloads that used to require multiple servers now can be handled in a single box because of increased general computing power, AI and media acceleration capabilities, and AI-enabled silicon options over time.
In addition to graphics processing units (GPUs), which have traditionally been synonymous with AI inference, there are now more options for edge AI processing than ever, including technologies built into CPUs for AI acceleration like neural processing units (NPUs) and integrated GPUs. This diversity of compute options at the edge results in more choice in device form factors and the possibility of lower price-per-stream for state-of-the-art video analytics and GenAI solutions.
Best Practices in Sizing Edge AI Infrastructure
Compared to video-only solutions like VMS servers or NVRs, determining the size (in computing power) of visual AI edge servers can be complex and difficult, but no less necessary. By sizing compute accurately, system integrators, in particular, can more easily win competitive project bids while ensuring the system will meet the performance expectations of the end customer.
Sizing for visual AI solutions begins with performing end-to-end benchmarking, which means measuring the performance of the entire visual AI pipeline – both media processing and AI inference. Depending on the use case, input video, and AI models, media processing can be the bottleneck in performance instead of AI inference. Estimating AI system performance capability by looking at specs like TOPS (trillions operations per second) will not be helpful in this case, since TOPS is theoretical and only applies to the AI inference part of the pipeline, not media processing.
The end-to-end approach applies not just to CNN-based video analytics but also to various GenAI use cases with video. Just as CNN AI models can range from quite small to very large, GenAI models also span a wide range, from millions to billions of parameters.
For the most accurate visual AI infrastructure sizing, nothing beats a proof of concept (POC) with actual video from the project site. The scene complexity (including number and speed of objects present) and video settings (resolution, frames per second, bitrate/compression) both materially affect visual AI performance.
Since POCs are often not possible or practical, an alternative approach is to leverage tools that help create and benchmark end-to-end video and AI pipelines. With project-specific visual AI system performance estimates in hand, system integrators and solution builders can more easily understand which AI systems deliver sufficient performance and make more informed purchasing decisions while de-risking deployment.
Next Steps With Visual AI
Solution builders can explore multiple AI system architectures and compute options with a total cost of ownership lens, based on the end-to-end, real-world workload. Consider tools that not only help develop solutions quicker and easier, but also assist with end-to-end benchmarking and system sizing to reduce price-per-stream for end customers.
System integrators can work with their technology partners to understand new use cases enabled by GenAI and explore all manner of system architectures, with reasonably accurate sizing estimates, to create value for their customers. Integrators may be surprised at how many video streams new visual AI infrastructures can handle.
End users have more options than ever. By working with trusted integrator and technology partners, they can better understand how investments in visual AI can not only improve security outcomes but also enhance overall operations.