There are multiple ways to generate synthetic datasets in Python, depending on the context and type of data needed. Below are a few approaches:
Using scikit-learn's make_classification or make_regression functions to generate classification or regression datasets, respectively:
main.py
from sklearn.datasets import make_classification, make_regression
# Generate classification dataset with 1000 samples, 4 features, and 3 classesX, y = make_classification(n_samples=1000, n_features=4, n_classes=3)
# Generate regression dataset with 1000 samples and 3 featuresX, y = make_regression(n_samples=1000, n_features=3)
333 chars
8 lines
Using numpy's random module to generate random arrays with desired shape and distribution:
main.py
import numpy as np
# Generate random array of shape (1000, 5) with uniform distributionX = np.random.rand(1000, 5)
# Generate random array of shape (1000, 2) with normal distributionX = np.random.normal(size=(1000,2))
222 chars
8 lines
Using third-party libraries such as Faker or Pandas to generate synthetic data with specific format:
main.py
from faker import Faker
import pandas as pd
fake = Faker()
# Generate DataFrame with 1000 rows and fake name, address, and job title columnsdf = pd.DataFrame([fake.name(), fake.address(), fake.job()] for _ inrange(1000))
df.columns = ['Name', 'Address', 'Job']