Two years ago, Microsoft unveiled Florence, an AI system, calling it a “complete rethinking” of current computer vision models. Florence was both “unified” and “multimodal,” which meant that it could understand both language and images and that it could handle a variety of tasks as opposed to being restricted to particular applications, like creating captions. Florence was also unlike most vision models at the time.
As part of an update to the Vision APIs in Azure Cognitive Services, Florence is now available as part of Microsoft’s larger, ongoing effort to commercialize its AI research. The Florence-powered Microsoft Vision Services, with features like automatic captioning, background removal, video summarization, and image retrieval, go live today in preview for current Azure customers.
“Billions of image-text pairs were used to train Florence. Consequently, it is incredibly versatile “According to John Montgomery, CVP of Azure AI, who spoke with TechCrunch via email. Ask Florence to locate a specific frame in a video, and it will be able to do so. You can also ask Florence to distinguish between a Cosmic Crisp and a Honeycrisp apple.
Multimodal models are viewed by the AI research community, which includes tech behemoths like Microsoft, as the best route to developing AI systems that are more powerful. Multimodal models, which again comprehend multiple modalities like language and images or videos and audio, are naturally able to complete tasks faster than unimodal models (e.g., captioning videos).
Why not combine several “unimodal” models to accomplish the same goal, such as a model that only comprehends images and another that comprehends only language? There are several reasons for this, the first of which is that multimodal models sometimes outperform their unimodal counterparts in the same task because of the contextual information provided by the additional modalities.
For instance, an AI assistant that comprehends images, pricing information, and purchasing history is more likely to provide more relevant product recommendations than one that only comprehends pricing information.
The second reason is that multimodal models often result in faster processing times and (possibly) lower back-end costs due to their higher computational efficiency. That is undoubtedly a plus because Microsoft is a profit-driven company.
How about Florence then? Because it comprehends language, video, and image modalities as well as the connections between them, it is able to perform tasks like measuring the degree to which two modalities are similar or segmenting objects in a picture and pasting them onto a different background.
In light of ongoing legal disputes that may determine whether AI systems trained on copyrighted data, including images, are infringing on intellectual property holders’ rights, I thought it was important to ask Montgomery which data Microsoft used to train Florence. He wouldn’t go into detail other than to say that Florence uses data sources that were “responsibly obtained,” including data from partners.
Additionally, Montgomery noted that Florence’s training data had been cleaned of any potentially offensive material, which is a problem with many open-source training datasets.
When using large foundational models, it is crucial to ensure the training dataset’s quality in order to lay the groundwork for customized models for each vision task, according to Montgomery. Additionally, the modified models for each Vision task have undergone testing for fairness, adversarial and challenging cases, and they implement the same content moderation services as we have been using for DALL-E and Azure Open AI Service.
We’ll have to believe what the company says. It appears that some customers are. According to Montgomery, Reddit will generate captions for images on its platform using the new Florence-powered APIs, creating “alt text” so users with vision impairments can follow along in threads more easily.
“Florence’s ability to generate up to 10,000 tags per image will give Reddit much more control over how many objects in a picture they can identify and help generate much better captions,” Montgomery said. “Reddit will also use captioning to help all users improve article ranking for searching for posts.”
Microsoft also uses Florence across a swath of its platforms, products and services.
On LinkedIn and Reddit, Florence-powered services will generate captions to edit and support alt-text image descriptions. In Microsoft Teams, Florence is driving video segmentation capabilities. PowerPoint, Outlook and Word leverage Florence’s image captioning abilities for automatic alt text generation. And Designer and OneDrive, courtesy of Florence, have gained better image tagging, image search and background generation.
Montgomery sees Florence being used by customers for much more down the line, like detecting defects in manufacturing and enabling self-checkout in retail stores. I’d note that none of those use cases requires a multimodal vision model. But Montgomery asserts that multimodality adds something valuable to the equation.
“Florence is a complete re-thinking of vision models,” Montgomery said. “Once there’s the easy and high-quality translation between images and text, a world of possibilities opens up. Customers will experience significantly improved image search, train image and vision models and other model types like language and speech into entirely new types of applications and easily improve the quality of their customized versions.”