Don’t scrape bad data. Mock good one ! Introducing Mojito.

Pierre Paci
3 min readJun 16, 2018
Not that kind of Mojito

Follow me on Twitter for more AI / Cloud / Tech related stuff!

In this machine learning age, there is still one big problem restraining us to unleash a wonderful swarm of AI to the world. 4 letters, DATA. Data is the new gold/oil/diamond/[insert precious thing] as told by pretty much the entire internet. But as soon as you start a project, you will find out that gathering clean and valuable data is a very difficult thing.

Of course, if you are a big company with multiple years of data lake building behind you, it will not be that hard, but that’s clearly not the case of everyone. Let’s take a example which happened to me recently. We were on a hackathon and we knew precisely how should look the data to be useful. In fact, we knew how the product would be used after the hackathon and so, precisely what should be the inputs and outputs of the machine learning algorithms. Looking around us, we saw everyone starting to play with scrapy, of more manually requests or some any similar Node librairies. And almost everyone encountered three steps :

  1. Finding source of data
  2. Write code to scrape data
  3. Write code to clean data

And even after, it was not obvious that the data will lead to a useful machine learning classifier/regressor.

We have taken another way, we mocked our data, and for that we created Mojito ! It’s not the first librairy promising to mock data, but Mojito is focused on generating proper data from a statistical point of view. This is mandatory if you need to train some AI with that data.

Mojito is built around statistical sampling to ensure homogeneous data distribution of your mocked data. That’s said, if you are providing correct rules to Mojito, your mocked data should be indistinguishable from real one !

The framework is based around one core idea. Samples are what you observe, let’s say visits on your website. Samples are generated by events. Events are describing how should generated samples look like and the statistical rules behing each property of each samples.

structure of an event

In addition, there is a possiblity to compose events. Let’s say that there are 2 peaks of visits on your website, one around 1PM and the other one around 8PM. Those two peaks are about 1 hour long. Each visitor is identified by it’s gender and age. People in the 1PM peak tend to be younger. In addition, there is background noise and some visitors may come at any time in the day. How could we model this with Mojito ?

Description of multiple events

Now let’s compose those events and draw 1000 visits (samples) from each peak, and 200 from the background noise. Also, we will only print the first 5 generated samples, for obvious reason. Note that you could add only one generator to a composer if needed.

How to compose and generate samples from events
generated samples

EventComposer have an option to directly output this dataset as a CSV file.

I hope that you enjoyed this introduction to Mojito and don’t hesitate to try it for any mocking needs !

--

--

Pierre Paci

Cloud Engineer. I’m specialized in Azure, Kubernetes, Helm, Terraform. Deploy everything as code! Except some various PoC around Gaming, AI and cryptos.