Unified learning processes: Self-supervised learning can finally become consistent

IMAGE CREDIT:
Image credit
iStock

Unified learning processes: Self-supervised learning can finally become consistent

Unified learning processes: Self-supervised learning can finally become consistent

Subheading text
Researchers have finally discovered a way to train algorithms through one input regardless of data type or format.
    • Author:
    • Author name
      Quantumrun Foresight
    • February 7, 2023

    Deep neural nets have traditionally been good at identifying objects in photos and videos, as well as processing natural language. However, most research surrounding self-supervised algorithms has concentrated on individual modalities, which can lead to bias.

    Unified learning processes context

    By self-supervising, computers can learn about their surroundings by examining them and constructing the meaning of images, audio recordings, or written words. It's more efficient to have machines that don't need manual instruction to discern pictures or comprehend spoken language. Most self-supervised learning research focuses on one area rather than multiple modalities. Therefore, researchers who focus on one area often have a completely different strategy than those focusing on another.

    For example, in speech processing, some self-supervised learning tasks do not have a vocabulary of speech units. As a result, several models come with mechanisms that learn an inventory of speech units. Learning tokens, regressing the input, or making data augmentation are some ways that computer vision researchers have attempted to combat this issue in the past. However, it is often difficult to tell whether these methods will be effective outside the original context.

    According to a 2022 Cornell University study, top theories on the biology of learning suggest that humans likely use similar processes to comprehend both visuals and language. Similarly, general neural network architectures have outperformed modality-specific counterparts. As such, in 2022, Meta introduced Data2vec, a system that uses a single algorithm to train a neural network to recognize images, text, or speech. 

    Disruptive impact

    Algorithms process images, text, and voice differently because they anticipate distinct units such as pixels, visual tokens, words, or sound inventories. The creation of algorithms is related to a particular modality, meaning those in different modalities will continue to work differently from each other. Data2vec allows models to operate with varying input types by focusing on representations, such as the layers of a neural network. With data2vec, there is no need to predict visual tokens, phrases, or sounds.

    Data2vec indicates that a self-teaching algorithm can not only work well across multiple scenarios but often do better than more traditional methods. This feature could lead to wider use of self-supervised learning and bring us closer to AI machines that can teach themselves about complex topics such as sports events or different ways of baking bread using movies, articles, and audio recordings.

    In a 2022 paper published in Nature journal, the researchers highlighted promising applications of self-supervised learning for developing models that utilize multimodal datasets. The study also discussed some challenges in collecting unbiased data for their training, such as methods used in medicine and healthcare. With self-supervised learning, the team could teach machines using only unlabeled data. This feat is a great starting point for any task within medicine (and beyond) to predict hidden information that cannot be clearly categorized. In the future, algorithms will be able to better recognize open-ended inputs and relate them with other datasets without human intervention.

    Implications of unified learning processes

    Wider implications of unified learning processes may include: 

    • Chatbots that can make recommendations and identify products based on screenshots and voice recordings.
    • Digital assistants that can simultaneously process visual and audio information, leading to more accurate services and responses.
    • Virtual characters and friends created in the metaverse that can learn by interacting with humans and eventually engage and converse with people in ways that feel increasingly lifelike. 
    • Smart appliances that can self-start based on audio and visual cues.
    • Enhanced autonomous vehicle capabilities that can accurately identify objects on the road or respond accordingly to police and ambulance sirens.
    • Better assistive technology that can help guide people with audio or visual impairments to improve their independence and mobility.

    Questions to comment on

    • How else can this technology create more intuitive devices and digital assistants?
    • What are some other ways that multimodal AI can help you at work?

    Insight references

    The following popular and institutional links were referenced for this insight: