generate synthetic dataset in python

There are multiple ways to generate synthetic datasets in Python, depending on the context and type of data needed. Below are a few approaches:

  1. Using scikit-learn's make_classification or make_regression functions to generate classification or regression datasets, respectively:
main.py
from sklearn.datasets import make_classification, make_regression

# Generate classification dataset with 1000 samples, 4 features, and 3 classes
X, y = make_classification(n_samples=1000, n_features=4, n_classes=3)

# Generate regression dataset with 1000 samples and 3 features
X, y = make_regression(n_samples=1000, n_features=3)
333 chars
8 lines
  1. Using numpy's random module to generate random arrays with desired shape and distribution:
main.py
import numpy as np

# Generate random array of shape (1000, 5) with uniform distribution
X = np.random.rand(1000, 5)

# Generate random array of shape (1000, 2) with normal distribution
X = np.random.normal(size=(1000,2))
222 chars
8 lines
  1. Using third-party libraries such as Faker or Pandas to generate synthetic data with specific format:
main.py
from faker import Faker
import pandas as pd

fake = Faker()

# Generate DataFrame with 1000 rows and fake name, address, and job title columns
df = pd.DataFrame([fake.name(), fake.address(), fake.job()] for _ in range(1000))
df.columns = ['Name', 'Address', 'Job']
265 chars
9 lines

gistlibby LogSnag