MOSTLY AI Synthetic Data Platform

MOSTLY AI: Smart imputation for improved data quality

MOSTLY AI introduces a solution for managing datasets plagued by missing values through its innovative smart imputation process. The example, utilising the UCI Adult Income Survey, illustrates the limitations of conventional imputation methods when dealing with deliberately removed age values. By opting for smart imputation, users can seamlessly configure settings to generate a synthetic dataset that effectively addresses the challenge of missing values. Comparative analyses between the synthetic data and the original dataset underscore the efficacy of this approach, showcasing a distribution closely mirroring the ground truth dataset. MOSTLY AI’s adoption of a generative AI approach ensures not only realism but also relevance in replacing values, thereby elevating data quality and providing users with a robust solution for their synthetic data needs.

Top 5 points on Mostly Ai

  • MOSTLY AI claims to offer the highest accuracy among synthetic data platforms in the market.
  • The platform provides users with a QA report during the training and generation process, detailing the accuracy in modeling distributions, relationships, and dataset properties compared to the original data.
  • The headline accuracy number gives an overall sense of how closely the synthetic data represents the source, calculated across each attribute in the dataset.
  • Accuracy is measured for individual distributions, pairs of variables in bivariate distributions, and sequential or time series data, with an additional measure called coherence for the latter.
  • MOSTLY AI’s QA report highlights major differences between the original and synthetic datasets using strong colors, serving as an indicator for potential model configuration adjustments.

AI-generated Synthetic Data vs. Mock Data

  1. AI-generated Synthetic Data Advantages:
  • Maintains correlations between variables, providing a more robust tool.
  • Preserves relationships crucial for downstream-driven consumption in data science and analytics.
  • Offers higher utility for use cases compared to mock data.
  1. Comparison Dataset Creation:
  • Used a US Census Income dataset within MOSTLY AI documentation.
  • Generated mock data using industry-standard methods like mean, standard deviations, and business logic for numerical and categorical columns.
  1. MOSTLY AI Synthetic Data Generation:
  • Utilized MOSTLY AI, an AI-generated synthetic data platform.
  • Produced synthetic data with a similar structure to the original dataset, maintaining accuracy and configurations.
  1. Correlation Matrix Analysis:
  • Created correlation matrices for the original, synthetic, and mock datasets, focusing on numerical columns for simplicity.
  • Color-coded matrices based on confidence levels to highlight relationships between variables.
  1. Comparison of Data Sets:
  • Original and synthetic data showed similar color-coded relationships, indicating effective correlation preservation.
  • Mock data demonstrated minimal correlations, emphasizing the significant difference in maintaining variable relationships between AI-generated synthetic data and mock data.

The MOSTLY AI Synthetic Data Platform offers immediate access to functionalities that allow you to initiate synthetic data generation or retrieve previously generated synthetic data. A detailed overview of these features is provided below for your review.

MOSTLY AI - Home page overview
  • File Upload:
  • Navigate to the “Upload Files” tab, where you can effortlessly upload CSV or Parquet files by either dragging and dropping them or selecting them through browsing. This quick process allows you to promptly configure and initiate the generation of a synthetic dataset.
  • Connect to a Source:
  • Visit the “Connect to a Source” tab to instantly establish a connection to a new database or cloud bucket. This feature streamlines the process of linking the platform to external data sources.
  • Synthesize a Sample Dataset:
  • Explore the “Or Use Sample Data” section, where you can promptly initiate the synthesis of a dataset using any available datasets. Simply choose a dataset and commence the creation of a new synthetic dataset by clicking the “Start” button.
  • Last Six Generated Synthetic Datasets:
  • Review the most recent six synthetic datasets under “Existing Synthetic Datasets.” Each dataset card provides insights into the overall accuracy of the AI models that were utilized to generate the synthetic data. This feature offers a quick overview of the quality of the generated datasets.

Upload files

Use the Upload files tab on the left side to drag-and-drop CSV (.csv), TSV (.tsv), or Parquet (.parquet) files from which you want to generate synthetic data.

Note
You can upload only one table of data. If you have a table that is split into multiple files, you can drag and upload all files.

MOSTLY AI - Home page Upload files tab

For next steps on how to configure a synthetic dataset, see Configure synthetic datasets.

Connect to a source

To generate synthetic data from an existing database or files stored in a cloud bucket, simply choose the “Connect to a Source” tab. This grants you direct access to the “Create Connector” workflow, allowing you to seamlessly set up the connection and initiate the process of generating synthetic data from your chosen source.

MOSTLY AI - Home page Upload files tab

For next steps on how to configure a connector, see Connectors.

Synthesize a sample dataset

In the “Or Use Sample Dataset” section, you have the option to promptly create synthetic data using one of the pre-prepared datasets. Simply choose a dataset and click “Start” to proceed to the “Start Job” screen, where you can configure the settings for the synthetic dataset.

MOSTLY AI - Home page Use sample data

Below is a description of each of the sample datasets.

DatasetDescriptionMore info
UCI Adult datasetThe UCI Adult dataset, also known as the Census Income dataset, is a well-known dataset used in machine learning and statistics. It contains census data from 1994 and consists of 48,842 instances, each representing an individual.Link(opens in a new tab)
Bank MarketingThe Bank Marketing dataset is another well-known dataset used in machine learning and statistics. It is also known as the UCI Bank Marketing dataset because it is hosted by the University of California, Irvine.Link(opens in a new tab)
Online ShoppersThe Online Shoppers Purchasing Intention dataset is a popular dataset hosted by the University of California, Irvine. It contains information on the browsing and purchasing behavior of visitors to an online store over a period of one year (from May 2010 to May 2011).Link(opens in a new tab)

Last six generated synthetic datasets

You can view a summary of the most recent six synthetic datasets in the “Existing Synthetic Datasets” section. Each dataset card displays the comprehensive accuracy of the trained AI models responsible for generating the synthetic data.

The overall accuracy is a combined statistical measure. For additional details on the QA Report and an in-depth understanding of how the accuracy score is computed, refer to the “Read the QA Report” section.

MOSTLY AI - Home page Use sample data

What is synthetic data?

Synthetic data is crafted information designed to emulate the features of real-world data. Nonetheless, a distinctive trait of synthetic data is its lack of correspondence to actual entities in the real world, such as individuals, organizations, institutions, and others.

What is AI-generated synthetic data?

MOSTLY AI offers the ability to generate tabular synthetic data.

Tabular synthetic data

Original Data:

Original data refers to the authentic information that organizations collect, encompassing details about data subjects, events and time series data, or reference information.

Generative AI Model:

Upon providing your original data to MOSTLY AI, a dedicated AI model is trained for each of your data tables. Each AI model learns the intricate patterns, correlations, distributions, and dependencies specific to its assigned data table.

Synthetic Data Generation:

Utilizing the trained AI model, MOSTLY AI employs random draws to generate a row of data within the synthetic data table. Notably, each generated row does not correspond to the same sequential row from the original table. Despite this, the generated synthetic data table accurately represents the original data by retaining its inherent patterns, correlations, distributions, and dependencies.

To maintain brevity in this conceptual introduction, the example focuses on explaining single-table synthetic data. For a deeper understanding, you can explore the intricacies of multi-table synthetic data as well.

Why AI-generated synthetic data?

In the contemporary landscape, organizations amass substantial volumes of data, encompassing personal information that necessitates safeguarding to ensure the privacy of both customers and business partners. Regulatory frameworks, such as GDPR, are instituted to shield private data, making it challenging to make such data available for broader analysis, testing, or sharing.

An established technique employed over time to unlock data containing private information is anonymization, albeit accompanied by several challenges. With AI-generated synthetic data, organizations can address these challenges effectively. Here’s how:

  1. Protect Privacy:
    AI-generated synthetic data serves as a robust solution to protect the privacy of data subjects, ensuring compliance with stringent data protection regulations.
  2. Avoid Anonymization Pitfalls:
    Unlike the error-prone process of anonymizing data, AI-generated synthetic data offers a more reliable and automated alternative, eliminating the pitfalls associated with traditional anonymization methods.
  3. Automated Data Generation:
    Through the training of AI models that learn the specific characteristics of original data, organizations can automate the generation of synthetic data, streamlining the process and enhancing efficiency.
  4. High Accuracy:
    AI-generated synthetic data achieves high accuracy, making it a seamless “drop-in replacement” for the original data. This ensures reliability and consistency in various applications.
  5. Enhanced Intelligence with Smart Features:
  • Data Rebalancing:
    Achieve fairer distributions, reduce bias, and improve model accuracy through data rebalancing techniques applied to synthetic data.
  • Smart Imputation:
    Replace missing values with meaningful ones generated by AI models, enhancing the completeness and reliability of the synthetic dataset.
  • Generation Mood:
    Control the diversity of created synthetic data using features like generation mood, allowing organizations to tailor data creation, including boosting the generation of outliers and edge cases.

By leveraging AI-generated synthetic data, organizations not only fortify privacy measures but also overcome the limitations of traditional anonymization methods. The automated and intelligent nature of synthetic data generation ensures accuracy, efficiency, and adaptability to diverse data needs in a privacy-conscious environment.

Sharing Is Caring:

Leave a Comment