Revolutionizing Machine Learning Workflows: The Waze Journey with TFX and Vertex AI

Waze

Waze is the world’s largest community-based traffic and navigation app. It uses real-time data to help users circumvent literal and figurative bumps in the road. On top of mobile navigation, Waze offers a web platform, a carpool app, partnership services, an advertisement platform and more. Such a broad portfolio brings along diverse technological challenges and many different use cases.

GIF of Waze logo

ML @Waze

Waze relies on many ML solutions, including:

  • Predicting ETA
  • Matching Riders & Drivers (Carpool)
  • Serving The Right Ads

But it’s not that easy to get something like these right and “production grade”. It is very common for these kinds of projects to have requirements for complex surrounding infrastructure for getting them to production and hence require multiple engineers (data scientist, software engineer and software reliability engineers) and a lot of time. Even more so when you mix in the Waze-y requirements like large scale data, low (real-time, actually) latency inference, diverse use cases, and a whole lot of geospatial data.

The above is a good reason why opportunistically starting to do ML created a chaotic state at Waze. For us it manifested as:

  • Multiple ML frameworks – you name it (sklearn, xgboost, TensorFlow, fbprophet, Java PMML, hand made etc.)
  • ML & Ops disconnect – models & feature engineering embedded in (Java) backend servers by engineers with limited monitoring and validation capabilities
  • Semi-manual operations for training, validation and deployment
  • A hideously long development cycle from idea to production

Overall, data scientists ended up spending a lot of their time on ops and monitoring instead of focusing on the actual modelling and data processing. At a certain level of growth we’ve decided to organize the chaos and invest in automation and processes so we can scale faster. We’ve decided to heavily invest in a way to dramatically increase velocity and quality by adopting a full cycle data science philosophy. This means that in this new world we wanted to build, a single data scientist is able to close the product cycle from research to a production grade service.

Data scientists now directly contribute to production to maximize impact. They focus on modelling and data processing and get many infrastructures and ops work out-of-the-box. While we are not yet at the end of this journey fully realizing the above vision, we feel like the effort layed out here was crucial in putting us on the right track.

Waze’s ML Stack

Translating the above philosophy to a tech spec, we were set on creating an easy, stable, automated and uniform way of building ML pipelines at Waze.

Deep diving into tech requirements we came up with the below criteria:

  • Simple — to understand, use, operate
  • Managed — no servers, no hardware, just code
  • Customizable — get the simple stuff for free, yet flexible enough to go crazy for the 5% that would require going outside the lines
  • Scalable — auto scalable data processing, training, inference
  • Pythonic — we need something production-ready, that works with most tools and code today and fits the standard data scientist. There are practically no other options than Python these days.

For the above reasons we’ve landed on TFX and the power of its built-in components to deliver these capabilities mostly out of the box.

It’s worth saying – Waze runs its tech stack on Google Cloud Platform (GCP).

It happens to be that GCP offers a suite of tools called Vertex AI. It is the ML infrastructure platform Waze is building on top of. While we use many components of Vertex AI’s managed services, we will focus here on – Vertex Pipelines – a framework for ML pipelines that helps us encapsulate TFX (or any pipeline) complexity and setup.

Together with our data tech stack, the overall ML architecture at Waze (all managed, scaled, pythonic etc.) is as follows:

graph of ML architecture at Waze

Careful readers will notice the alleged caveat here – we go all in on TensorFlow.

TFX means TensorFlow (even though that’s not exactly true anymore, let’s assume it is).

It might be a little scary at first when you have many different use cases.

Fortunately, the TF ecosystem is rich and Waze has the merit of having large enough data that neural nets converge.

Since starting this we’ve yet to find a use case that TF magic does not solve better or adequately as other frameworks (and not talking about micro % points, not trying to do a Kaggle competition here but get something to production).

Waze TFX

You might think that landing on TFX and Vertex Pipelines solved all our problems, but that’s not exactly true.

In order to make things truly simple we’ve had to write some “glue code” (integrating the various products in the above architecture diagram) and abstracting enough details so the common data scientist could use this stuff effectively and fast.

That resulted in:

  • Eliminated boilerplate
  • Hiding all common TFX components so data scientists only focus on feature engineering and modelling and get the entire pipeline for free
  • Generating BigQuery based train / eval split
  • Providing pre-implemented optional common features transform (e.g. scaling, normalization, imputations)
  • Providing pre-implemented Keras models (e.g. DNN/RNN model. TF Estimator like but in Keras that speaks TFX)
  • Utility functions (e.g. TF columns preparation)
  • Unit testing framework for tf.transform feature engineering code
  • Orchestrated and scheduled pipeline runs from Airflow using a Cloud run instance with all TFX packages installed (without installing it on the Airflow composer)

We’ve put it all in an easy to use Python package called “waze-data-tfx”

Pyramid chart showing levels of Waze data tfx

On top, we provided a super detailed walkthrough, usage guides and code templates, to our data scientists, so the common DS workflow is: fork, change config, tweak the code a little, deploy.

For reference this is how a simple waze-data-tfx pipeline looks like:

  1. Configuration
    _DATASET_NAME = 'tfx_examples'
    _TABLE_NAME = 'simple_template_data'
    
    _LABEL_KEY = 'label'
    _CATEGORICAL_INT_FEATURES = {
       "categorical_calculated": 2,
    }
    _DENSE_FLOAT_FEATURE_KEYS = ["numeric_feature1", "numeric_feature2"]
    _BUCKET_FEATURES = {
       "numeric_feature1": 5,
    }
    _VOCAB_FEATURES = {
       "categorical_feature": {
           'top_k': 5,
           'num_oov_buckets': 3
       }
    }
    
    _TRAIN_BATCH_SIZE = 128
    _EVAL_BATCH_SIZE = 128
    _NUM_EPOCHS = 250
    
    _TRAINING_ARGS = {
       'dnn_hidden_units': [6, 3],
       'optimizer': tf.keras.optimizers.Adam,
       'optimizer_kwargs': {
           'learning_rate': 0.01
       },
       'layer_activation': None,
       'metrics': ["Accuracy"]
    }
    
    _EVAL_METRIC_SPEC = create_metric_spec([
       mse_metric(upper_bound=25, absolute_change=1),
       accuracy_metric()
    ])
  2. Feature Engineering
    def preprocessing_fn(inputs):
       """tf.transform's callback function for preprocessing inputs.
    
       Args:
           inputs: map from feature keys to raw not-yet-transformedfeatures.
    
       Returns:
           Map from string feature key to transformed feature operations.
       """
       outputs = features_transform(
           inputs=inputs,
           label_key=_LABEL_KEY,
           dense_features=_DENSE_FLOAT_FEATURE_KEYS,
           vocab_features=_VOCAB_FEATURES,
           bucket_features=_BUCKET_FEATURES,
       )
       return outputs
  3. Modelling
    def _build_keras_model(**training_args):
       """Build a keras model.
    
       Args:
           hidden_units: [int], the layer sizes of the DNN (input layer first).
           learning_rate: [float], learning rate of the Adam optimizer.
    
       Returns:
           A keras model
       """
       feature_columns = \
           prepare_feature_columns(
               dense_features=_DENSE_FLOAT_FEATURE_KEYS,
               vocab_features=_VOCAB_FEATURES,
               bucket_features=_BUCKET_FEATURES,
           )
    
       return _dnn_regressor(deep_columns=list(feature_columns.values()),
                             dnn_hidden_units=training_args.get(
                                 "dnn_hidden_units"),
                             dense_features=_DENSE_FLOAT_FEATURE_KEYS,
                             vocab_features=_VOCAB_FEATURES,
                             bucket_features=_BUCKET_FEATURES,
                             )
  4. Orchestration
    pipeline_run = WazeTFXPipelineOperator(
       dag=dag,
       task_id='pipeline_run',
       model_name='basic_pipeline_template',
       package=tfx_pipeline_basic,
       pipeline_project_id=EnvConfig.get_value('gcp-project-infra'),
       table_project_id=EnvConfig.get_value('gcp-project-infra'),
       project_utils_filename='utils.py',
       gcp_conn_id=gcp_conn_id,
       enable_pusher=True,
    )

Simple, right?

When you commit a configuration file to the code base it gets deployed and sets up continuous training, and a full blown pipeline including all TFX and Vertex AI magics like data validation, transforms deployed to Dataflow, monitoring etc.

Summary

We knew we were up to something good when one of our data scientists came back from a long leave and had to use this new framework for a use case. She said that she was able to spin up a full production-ready pipeline in hours, something that before her leave would have taken her weeks to do.

Going forward we have much planned that we want to bake into `waze-data-tfx`. A key advantage that we see in having this common infrastructure is that once a feature is added, then everyone can enjoy it “for free”. For example, we plan on adding additional components to the pipeline, such as Infra Validator and Fairness Indicators. Once these are supported, every new or existing ML pipeline will add these components out-of-the-box, no extra code needed.

Additional improvements we are planning are around deployment. We wish to provide deployment quality assurance while automating as much as possible.

One way we are currently exploring doing so is using canary deployments. A data scientist will simply need to configure an evaluation metric and the framework (using Vertex Prediction traffic splitting capabilities and other continuous evaluation magic) would test the new model in production and gradually deploy or rollback according to the evaluated metrics.