Quantumrun

IMAGE CREDIT:

iStock

Synthetic data: Creating accurate AI systems using manufactured models

To create accurate artificial intelligence (AI) models, simulated data created by an algorithm is seeing increased utility.

Author:
Author name
Quantumrun Foresight
May 4, 2022

Insight summary

Synthetic data, a powerful tool that has applications ranging from healthcare to retail, is reshaping the way AI systems are developed and implemented. By enabling the creation of diverse and complex datasets without endangering sensitive information, synthetic data is enhancing efficiency across industries, preserving privacy, and reducing costs. However, it also presents challenges, such as potential misuse in creating deceptive media, environmental concerns related to energy consumption, and shifts in labor market dynamics that need to be carefully managed.

Synthetic data context

For decades, synthetic data has existed in different forms. It may be found in computer games like flight simulators and in physics simulations that depict everything from atoms to galaxies. Now, synthetic data is being applied within industries such as healthcare to solve real-world AI challenges.

The advancement of AI continues to run into several implementation obstacles. Large data sets, for example, are required to deliver trustworthy findings, be free of bias, and adhere to increasingly stricter data privacy regulations. Amid these challenges, annotated data created by computerized simulations or programs have emerged as an alternative to genuine data. This AI-created data, known as synthetic data, is critical to resolving privacy concerns and eradicating prejudice since it can ensure data diversity that reflects the actual world.

Healthcare practitioners use synthetic data, as an example, within the medical images sector to train AI systems while maintaining patient confidentiality. The virtual care firm, Curai, for instance, used 400,000 synthetic medical cases to train a diagnosis algorithm. Furthermore, retailers such as Caper use 3D simulations to create a synthetic dataset of a thousand photographs from as little as five product shots. According to a Gartner study released in June 2021 focused on synthetic data, most of the data utilized in AI development will be artificially manufactured by legislation, statistical standards, simulations, or other means by 2030.

Disruptive impact

Synthetic data aids in the preservation of privacy and the prevention of data breaches. For example, a hospital or corporation may offer a developer high-quality synthetic medical data to train an AI-based cancer diagnosis system—data that is as complex as the real-world data this system is meant to interpret. In this way, the developers have quality datasets to use when designing and compiling the system, and the hospital network does not run the risk of endangering sensitive, patient medical data.

Synthetic data can further allow buyers of testing data to access information at a lower price than traditional services. According to Paul Walborsky, who co-founded A.I. Reverie, one of the first dedicated synthetic data businesses, a single image that costs $6 from a labeling service can be artificially generated for six cents. Conversely, synthetic data will pave the way for augmented data, which entails adding new data to an existing real-world dataset. Developers could rotate or brighten an old image to make a new one.

Lastly, given privacy concerns and government restrictions, personal information existing in a database is becoming increasingly legislated and complex, making it harder for real-world information to be used to create new programs and platforms. Synthetic data could provide developers with a workaround solution to replace highly sensitive data.

Implications of synthetic data

Wider implications of synthetic data may include:

The accelerated development of new AI systems, both in scale and diversity, that improve processes in numerous industries and fields of discipline, leading to enhanced efficiency in sectors like healthcare, transportation, and finance.

Enabling organizations to share information more openly and teams to collaborate and operate more efficiently, leading to a more cohesive work environment and the ability to tackle complex projects with ease.

Developers and data professionals being able to email or carry large synthetic data sets on their laptops, safe in knowing that critical data is not being endangered, leading to more flexible and secure work conditions.

The reduced frequency of database cybersecurity breaches, as authentic data will no longer need to be accessed or shared as often, leading to a more secure digital environment for businesses and individuals alike.

Governments gaining more freedom to implement stricter data management legislation without worrying about impeding industry development of AI systems, leading to a more regulated and transparent data usage landscape.

The potential for synthetic data to be used unethically in creating deepfakes or other manipulative media, leading to misinformation and erosion of trust in digital content.

A shift in labor market dynamics, with increased reliance on synthetic data potentially reducing the need for data collection roles, leading to job displacement in certain sectors.

The potential environmental impact of increased computational resources required to generate and manage synthetic data, leading to higher energy consumption and associated environmental concerns.