My Victory in the “Mostly AI” Synthetic Data Competition

My Victory in the “Mostly AI” Synthetic Data Competition My Victory in the “Mostly AI” Synthetic Data Competition

Mostly AI Prize wrapped up with a surprise win by an independent competitor. They took first place in both the FLAT and SEQUENTIAL synthetic data generation challenges.

The contest tasked entrants with creating synthetic datasets that match the original data’s statistics—without copying any actual records. The FLAT challenge focused on 100,000 records with 80 columns. The SEQUENTIAL challenge targeted 20,000 sequences of data points, preserving event order and coherence.

The winner pivoted from juggling multiple models to zeroing in on smart post-processing. Using the Mostly AI SDK to oversample millions of rows, they then ran these three steps to fine-tune the final dataset:

Advertisement

  1. Iterative Proportional Fitting (IPF) — Select an oversized subset (125,000 rows) weighted by matching key bivariate column pairs.
  2. Greedy Trimming — Axe the 25,000 worst-fitting samples to hit the target size (100,000).
  3. Iterative Refinement — Swap out poor rows for better matches from the leftover pool.

This method bumped FLAT challenge accuracy from around 0.96 to a near-perfect 0.992. The SEQUENTIAL challenge used a similar approach but skipped IPF, focusing instead on sequence-level coherence and greedy swaps.

Heavy optimization was key. Reducing data types and using sparse matrices cut memory use. They optimized bottlenecks with Numba, speeding up critical loops by hundreds of times over pure Python.

The winner sums up the edge as coming from targeted post-processing tailored to the competition metrics—no extra fancy ML tricks needed.

“Even though ML models are getting increasingly stronger, I think that for most problems that Data Scientists are trying to solve, the secret ingredient is often not in the model. Of course, a strong model is an integral part of a solution, but the pre- and postprocessing are equally important.”

The full code and solution are open source at the prize’s GitHub repository.

Top five leaderboards for each challenge are now public, showing tight competition but a clear winner with this post-processing edge.


Source: mostly-ai/the-prize-eval

Add a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Advertisement