Identifying Feature Launch Dates Using Gini Impurity
Jon Simon, 7 Jul 2018
In theory a feature update is a very black-and-white event: before the update all requests use the old version of the feature, and after the update all requests use the new version of the feature. Unfortunately the real world is not so tidy, and in our data we often observe requests which utilize a new feature prior to its official release (e.g. when the feature was undergoing pre-launch testing) as well as requests which fail to utilize a new feature even after its official release (e.g. legacy recurring-work tasks).
Figure 1. Example of a request feature launch occurring in late May.
From an analytics perspective, this makes it very difficult to analyze the before/after effects of a given feature release, since these edge cases mean that the release boundary is fuzzy, and consequently difficult to programatically identify. Thankfully, there exists a simple mathematical formalism from machine learning which is perfectly suited to pinpointing these fuzzy change-over points: Gini impurity.
Gini impurity is a measure of the homogeneity of a set of labels, and most commonly arises in the context of decision tree learning where it's used to decide whether or not to split on a given dimension.
Formally, for a set of n items having k distinct labels, Gini impurity is computed as:
where ci is the number of items having label i
This can be understood as the probability that we misclassify an item in the set, assuming that we randomly assign labels to items according to the set-wide label distribution. The Gini impurity attains a minimum of 0 if all items have the same label, and attains a maximum of 1‑1/k if all k labels occur in equal numbers. A plot of this function when k=2 is shown below, alongside two other measures of label homogeneity:
Figure 2. Illustration of how GI, and related measures, change as a function of class label homogeneity
Our problem of identifying when a feature launch occurred is another such two-class situation, where the two label classes are (1) requests which use the old feature version, and (2) requests which use the new feature version. To identify when a feature launch occurred, we look for the timepoint such that GIbefore+GIafter is minimized. For the feature release shown in Figure 1, we can tell from eyeballing the plot that it was launched around May 25th, and superimposing the GI value curve on top of this data, we see that this is precisely where the minimum is attained.
Figure 3. Computing GI as described above for each timepoint, we find that GIbefore+GIafter is minimized precisely when the feature launch occurred.