Learning To Rank with TensorFlow Ranking

11 min readJul 8, 2022

Recently, I built my first Learning To Rank (LTR) using the TensorFlow Ranking (TFR) library and Microsofts public ranking dataset. I am writing this blog hoping to shorten the learning curve of building an LTR system. I will cover the steps I took throughout my project, resources I found useful, and mistakes I made (or the ones I remember).

To start, what is ranking, and why do we need it? Ranking systems allow documents to be sorted based on relevance to a specific query. An example of a ranking system could be personalized recommendations on your Amazon page, where the solutions rank items based on prior interactions or purchases. Another example would be search engines (Google, Bing), where the system considers various factors such as location to find relevant documents. For example, say you are located in Arizona and search the term “Cardinals”. The Arizona Cardinals (NFL Team) would be more relevant than someone living in Maine. Where maybe, the system provides documents associated with the bird. Ranking systems have allowed for considerable improvements in search engine results and can boost a company’s end-user experience by tailoring the platform to a unique end-user.

Project time!

Data Download & Cleaning

Microsoft has two public datasets released ~2010 for download, or you can use the following in Jupyter or the command line.

!wget https://storage.googleapis.com/personalization-takehome/MSLR-WEB10K.zip

You will find two separate datasets provided if you choose to use the public datasets link. For this project, I used the smaller dataset, which consists of 10,000 queries.

The raw dataset contains five folders: train, test, and val text files. I started data cleaning and integrity checks by randomly importing a few files using Modin.Pandas. Modin is an excellent open-source tool to speed up your panda’s workflow, and I would highly recommend utilizing it on your projects. Below is an example output from one of the files.

The first thing I noticed when importing the data was that all of the data has its respective column number as a prefix. For example, all column one’s data had the prefix “qid:”. Additionally, none of the columns are labeled. Therefore, I created a features.csv file that contains the column descriptions based on the data provided by Microsoft (located in the public datasets link).

In total, I created three functions to format the data. The first function shown ties all three functions together. It starts by loading the data, labeling the columns generically, cleaning the data, renaming the columns based on the CSV file, and saving the files to a CSV. For further details on the remaining functions, you can visit the source code here.

Below is an output of the cleaned data.

The next step was verifying the data integrity. The goal is to confirm that the data is balanced across all datasets. To prove this, I created the below function to print out the basic stats of each folder (seen below).

This function loads the data, checks value counts within each set, prints the percentage of data in each folder, checks NaN values across the entire dataset, and calculates the percentage of our target variables within the training dataset.

Every folder had almost the same statistics proving that Microsoft utilized a stratified method when creating the dataset. As seen above, the data has been split 60/20/20 across the train/test/val datasets, column 137 is missing 100% of its data, and there is a drastic class imbalance across our target variable. With column 137 missing 100% of its values, I decided to remove it from the dataset.

The target for our model is the relevance label. Relevance is scaled between 0 and 4, with 0 being completely irrelevant. Below is a visualization of the class imbalance.

The class imbalance is an issue for the model as only 0.007% is ranked at the highest l. However, I think it is safe to assume that there will be significantly more irrelevant documents than relevant documents in the wild. Moving forward, I did not address the class imbalance in any way.

Building an LTR System

Ok, so the data has been cleaned, and some basic statistics are known. It is now research and decision time. How do you build LTR systems within the TFR library, and what format does the data need to be? What loss metrics achieve the best results?

My head hurts. Let us start wrapping our heads around these problems.

Diving into LTR system methods, there are three methods to tackle this problem Pointwise, Pairwise, and Listwise. Below is a general overview of each approach:

Pointwise is the simplest method and is not a real ranking solution. Pointwise looks at one document at a time and is solved as any other machine learning task, which can be solved via a classification or regression. This method is not a real ranking solution because the algorithm focuses on being close to a label that does not apply to ranking documents. In short, it does not order relevant documents. Pairwise is the original solution for ranking problems, and I assume this method is outdated within production systems.
Pairwise looks at a pair of documents at a time, attempts to identify an optimal ordering for the pair and compares it to the ground truth. The goal of the loss function is to limit the cases where the pair of results are in the incorrect order compared to the ground truth. Pairwise achieves better results in practice than Pointwise and is implemented in many cutting-edge algorithms. I found this Wikipedia page which has a great breakdown of various models, timelines of when the models were developed, and state-of-the-art designs.
Listwise views an entire list of documents and tries to identify the optimal order. This approach either directly optimizes a measure, such as Normalized Discounted Cumulative Gain (NDCG) (we will dive deeper into this below), or minimizes a loss function defined based on the understanding of the ranking problem. In my research, I found this method to achieve the best results but has the highest complexity.

For a more in-depth explanation of these different methods and the math behind them, I highly recommend this Medium article by Francesco Casalegno.

After researching the architectures behind these systems, I decided to move forward with the Listwise approach. Based on these systems achieving the best results on multiple research papers, I found the added complexity will only be a small bump in the road.

To utilize the TFR pipeline, you need to convert the data into TFRecords based on the selected approach. Since my approach is Listwise, the documents must be grouped by Query ID.

To start, the class loads in the original text files using the sklearn.datasets.load_svmlight_file. I found this library to outperform Pandas substantially and would highly recommend its implementation in your project.

The below class reads in the text files, concatenates the files to build the dataset, passes the data into the TFRecord function, and converts the text files to TFRecords.

The records are structured using the tensorflow_serving.APIs.input_pb2.ExampleListWithContext() library which will allow for future integrations if needed. To elaborate, ExampleListWithContext structures the data in two parts, contextual and example. The contextual aspect could allow for additional text features, such as a user search, which could lead to the implementation of TFR-BERT, enabling the model to gain more predictive power. BERT is a pre-trained open-source Natural Language Processing model released by Google in 2018, which allows for fine-tuned high-performing models trained on smaller datasets via transfer learning. Per the TFR website:

Instead of learning a BERT representation scored independently across <query, document> pairs, LTR models apply a ranking loss to jointly learn a BERT representation that maximizes the utility of the entire ranked list concerning the ground-truth labels.

For this project, the contextual aspect is an empty dictionary and is unused. I think it is best to build out an architecture that allows for future growth with minimal changes, which is one reason I chose to move forward with this feature.

Below is the code used to perform the conversion. I started by instantiating the ExampleListWithContext (ELWC) object, adding a variable flag for query_id, creating an array of feature names mapped from the features.csv, and looping over the data for formatting. Inside the loop, the features are created, converted to tf.train.example, and added to the ELWC variable based on query_id. Once the query_id has changed, the examples are saved as TFRecords and ready to be implemented in the pipeline.

TFR Pipeline

Now for the meat and potatoes behind the TFR Library. TFR provides an outstanding pipeline package that ties everything together needed for modeling. The above picture shows the necessary steps to build a pipeline which I will go through below.

Note: The TFR library comes with example Jupyter notebooks for different applications. I found these helpful during this project and allowed for some guidance; though it was far from copy and paste, it was worth the time investment.

Note: The first four sections are located within tfr.keras.model section of the TFR library.

Define input_creator

To start, we build the input layer for the model to ingest the data. The input_creator (TFR Source Code) returns two dictionaries (context and examples) to the corresponding tf.keras input.

Define Preprocessor

There are various transformations that you can make when building an LTR system. The above code is sourced from TFR’s website, which implements a log1p transformation.

log1p(x) = log(1 + x)

Diving into this transformation, a standard log transformation removes the skewed distribution while maintaining the dimensionality in the feature space. The downside to a log transformation is that it is inaccurate when dealing with small numbers. For example, say x = 1e-100; if you apply a log transformation, it will return 0.

Log1p accomplishes the same task as a standard log transformation with a higher degree of accuracy. To continue the above example, when applying x to the Log1p transformation, the returned value would be 1e-100.

During modeling, I added a transformer flag to the modeling function to easily allow experimentation. After training the same model, with and without a transformation, the log1p adjustment significantly improved the results. Based on this paper by Google Research, a mixture of transformations performed the best, which was not attempted on this project, and log1p placed second.

Define Scorer

You specify the model’s hidden layers, activation functions, and additional keyword arguments within the scorer.

Note: All arguments are inherited from tfr.keras.layers.create_tower with additional keyword arguments inherited from tf.keras.Dense.

Within the TFR library, you have three different scorers DNN, GAM, and Univariate. As seen above, I used the DNNScorer, which has shown the best results though it is a black-box model. The GAMScorer tends to perform worse but allows for interpretability.

Per TFR’s website:

For some ranking systems, such as loan eligibility assessment, advertisement targeting, or guidance for medical treatment, transparency and explainability are critical considerations. Applying Generalized additive models (GAMs) with well-understood weighting factors can help your ranking model be more explainable and interpretable.

For this current project, interpretability is not crucial; therefore, we do not need to sacrifice performance.

Make Model_builder

Model_builder is the final step within the tfr.keras.model section of the library and where we tie everything together for implementation into the pipeline.

Now that we have built the model structure, it is time to dive into the pipeline, incorporate data, and train a model.

Specify DatasetHparams

Within the DatasetHparams, you are telling the model where to find the data, batch sizes for training, max list size, and what tool to use when reading the data. The data paths and batch size are typical of most machine learning projects.

On the other hand, list size was something I had not come across before this project. Essentially, list size either restricts the list to a particular length or pads the list to the set size. If left blank, it pads all lists to match the length of the largest list in the training data. Additionally, there is an optional argument for validation data list size, which defaults to the list_size value if left blank.

Make Dataset Builder

After creating the parameters for the dataset, we now need to build the dataset. As seen above, I moved forward with the SimpleDatasetBuilder as I only deal with a single label. However, alternative dataset builders exist for more complex datasets, which you can view here.

These parameters are pretty straightforward as we are tying together our previous arguments. One new parameter is the mask_feature_name. This argument creates a feature dictionary with the features name and a tf.bool tensor indicating whether the example is padded or not.

Specify Pipeline Hparams

Hopefully, at this point, you are starting to see the repetitive process of defining parameters for a task and using the library to build out that task. The final step is creating the pipeline, starting with the model’s hyperparameters.

Again, all of these arguments are identical to other machine learning projects. However, there are two critical differences in the LTR library; model_dir and loss.

The model_dir argument is the path where TFR will save the model and training data, including Tensorboard files for training visualization.

As mentioned earlier in the article, I found NDCG to perform best when applied to ranking problems. The idea is that when search results are ordered from most relevant to least relevant, the NDCG score is maximized because the further down the list of results you go, the more the metric “discounts” the result, so it pays to have the best results up top. For a deeper dive into NDCG, you can view my medium post here.

That said, multiple versions of NDCG loss within the TFR library exist. As seen in the table at the end of this blog, GumbelApproxNDCGLoss achieved the best results.

Define Ranking Pipeline

Now for the easy part, all that’s left is to tie everything together and let TFR do the rest! Remember that your pipeline will be based on your approach (abstract, multi-label, or simple).

Training

The final step is to train the model. Again, all training information will be stored in the output directory defined during the Pipeline Hparams step.

Final Evaluation

Based on this google research page, I was happy with how the model performed. My results are slightly lower than what they accomplished, which makes sense since they used the larger dataset from Microsoft.

I hope this blog post helped you learn about ranking systems and the overall process behind TensorFlow Ranking! If you have any questions, please don’t hesitate to reach out!

Learning To Rank with TensorFlow Ranking

Define input_creator

Written by Kyle Dufrane

No responses yet