With companies and organizations producing more and more data, a large set of rich and interesting datasets has become available in recent years. In addition, some of these organizations are embracing the concept of open data, enabling the public dissemination and use of the data by any interested party. Recently, the New York State Freedom of Information Law (FOIL) made available an extremely detailed dataset of New York City taxi trip records from every taxi trip of 2013. This dataset collected various sets of information on each individual taxi trips including the pickup and drop-off location, time and duration of the trip, distance travelled, and fair amount. You’ll see that this data qualifies as real-world data, not only because of the way it has been generated but also in the way that it’s messy: there are missing data, spurious records, unimportant columns, baked-in biases, and so on. And speaking of data, there’s a lot of it! The full dataset is over 19 GB of GSV data, making it too large for many machine-learning implementations to handle on most systems.
When working with machine learning, it’s critical to watch out for pitfalls: too-good-to-be-true scenarios and making premature assumptions that are not rooted in the data. As a general rule in ML., if the cross-validated accuracy is higher than you’d have expected, chances are your model is cheating somewhere. The real world is creative when trying to make your life as a data scientist difficult. When building initial tip/no-tip classification models, we quickly obtained a very high cross-validated predictive accuracy of the model. Because we were we so excited about the model performance on this newly acquired dataset – we nailed it – we temporarily ignored the warnings of a cheating model. But having been bitten by such things many times before, the overly optimistic results caused us to investigate further. One of the things we looked at was the importance of the input features (as you’ll see in more detail in later sections). In our case, a certain feature totally dominated in terms of feature importance in the model: payment type. From some taxi experience, this could make sense. People paying with credit cards (in the pre-Square era) may have a lower probability of tipping. If you pay with cash, you almost always round up to whatever you have the bills for. So we started segmenting the number of tips versus no tips for people paying with a credit card rather than cash.
- Data Science