Vokenization language ai can see

IMAGE CREDIT:

iStock

Vokenization: Language that AI can see

With images now being incorporated into artificial intelligence (AI) systems training, robots might soon be able to “see” commands.

Author:
Author name
Quantumrun Foresight
May 9, 2023

Natural language processing (NLP) has enabled artificial intelligence (AI) systems to learn human speech by understanding words and matching context with the sentiment. The only downside is that these NLP systems are purely text-based. Vokenization is about to change all that.

Vokenization context

Two text-based machine learning (ML) programs are often used to train AI to process and understand human language: OpenAI’s Generative Pre-trained Transformer 3 (GPT-3) and Google's BERT (Bidirectional Encoder Representations from Transformers). In AI terminology, the words used in NLP training are called tokens. Researchers from the University of North Carolina (UNC) observed that text-based training programs are limited because they cannot "see," meaning they cannot capture visual information and communication.

For example, if someone asks GPT-3 what the color of the sheep is, the system will often answer "black" even if it's clearly white. This response is because the text-based system will associate it with the term "black sheep" instead of identifying the correct color. By incorporating visuals with tokens (voken), AI systems can have a holistic understanding of terms. Vokenization integrates vokens into self-supervised NLP systems, allowing them to develop "common sense."

Integrating language models and computer vision is not a new concept, and it is a rapidly expanding field in AI research. The combination of these two types of AI leverages their individual strengths. Language models like GPT-3 are trained through unsupervised learning, which allows them to scale easily. In contrast, image models like object recognition systems can directly learn from reality and do not rely on the abstraction provided by the text. For example, image models can recognize that a sheep is white by looking at a picture.

Disruptive impact

The process of vokenization is pretty straightforward. Vokens are created by assigning corresponding or relevant images to language tokens. Then, algorithms (vokenizer) are designed to generate vokens through unsupervised learning (no explicit parameters/rules). Common sense AI trained through vokenization can communicate and solve problems better because they have a more in-depth understanding of context. This approach is unique because it not only predicts language tokens but also predicts image tokens, which is something that traditional BERT models are unable to do.

For example, robotic assistants will be able to recognize images and navigate processes better because they can “see” what is required of them. Artificial intelligence systems trained to write content will be able to craft articles that sound more human, with ideas that flow better, instead of disjointed sentences. Considering the wide reach of NLP applications, vokenization can lead to better-performing chatbots, virtual assistants, online medical diagnoses, digital translators, and more.

Additionally, the combination of vision and language learning is gaining popularity in medical imaging applications, specifically for automated medical image diagnosis. For example, some researchers are experimenting with this approach on radiograph images with accompanying text descriptions, where semantic segmentation can be time-consuming. The vokenization technique could enhance these representations and improve automated medical imaging by utilizing the text information.

Applications for vokenization

Some applications for vokenization may include:

Intuitive chatbots that can process screenshots, pictures, and website content. Customer support chatbots, in particular, may be able to accurately recommend products and services.

Digital translators that can process images and videos and provide an accurate translation that considers cultural and situational context.

Social media bot scanners being able to conduct a more holistic sentiment analysis by merging images, captions, and comments. This application can be useful in content moderation that requires the analysis of harmful images.