Data Science Report: Analysis of Instacart Data and Prediction Methods

Verified

Added on 2022/09/01

AI Summary

This report delves into a data science analysis of Instacart's market basket data, focusing on predicting customer reorders. The assignment examines how a data scientist approached the problem, utilizing gradient boosting models, particularly XGBoost, and complex feature engineering to extract meaningful patterns from the data. The report highlights the importance of understanding temporal behavioral patterns and discusses the challenges of predicting reorders compared to traditional recommendation systems. The analysis includes an overview of the tools and techniques employed, such as NumPy, Scipy, and scikit-learn for data visualization and analysis. It also points out the significance of modifying the competition's evaluation metric to achieve better results. The report concludes by emphasizing the potential for even greater results with larger datasets and the effectiveness of the described methods for gaining insights into the problem.

Running head: APPLIED DATA SCIENCE AND ANALYTICS
APPLIED DATA SCIENCE AND ANALYTICS
NAME OF THE STUDENT
NAME OF THE UNIVERSITY

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

2APPLIED DATA SCIENCE AND ANALYTICS
The blog post tries to create an understanding of a data science problem created by
Kaggle which challenges users (https://medium.com/kaggle-blog/instacart-market-basket-
analysis-feda2700cded ) to predict which grocery stores will be purchased by an Instacart
customer and when. The post then takes the reader behind the working of the solution by one
of the 2nd place holder of solving this problem.
Retail is a vast industry with big companies opening chains across different countries.
As the growth of a retail chain expands from a modest size to cross country and even an inter
country network the prospect of information regarding its operations becomes enormous and
ripe for using to make informed recommendations for strategic decision making.
It tries to know in advance when a stock of a particular product bought by a consumer
will end and the consumer will purchase the product again. The difference between predicting
which products a customer might want to buy that uses a recommendation algorithm and this
particular case is that this problem relies on understanding temporal behavioural patterns.
Whereas Netflix might be fine assuming you want to watch another movie similar to the one
you just watched, it’s less clear that you’ll want to reorder a fresh batch of almond butter or
toilet paper if you bought them yesterday.
The goal of the problem is to correctly predict grocery reorders. The data provided has a
user’s purchase history and the aim of the problem is to predict which of their previously
purchased product they will reorder.
The problem was solved by a Japanese data scientist Kazuki Onodera and he used a mixture
of gradient boosted tree models, complex feature engineering, and a special modelling of the
competition’s F1 evaluation metric to solve the problem.

3APPLIED DATA SCIENCE AND ANALYTICS
Some of the tools used for the problem solving are XGBoost, Numpy, Scipy, sci- kit learn,
were used for visualisation. XG boost provides a gradient boosting framework for Java,
Python and R.
After the data analysis a lot of otherwise unseen patterns emerged which helped in solving
the problem. For example there was a noticeable pattern in the history of reordering. It might
be intuitive to think a user that buys an order many times will likely buy it again. However
there are instances where an order is not bought for some reason. A pattern was found to
predict when a user would not reorder an item.
In the end it can be seen that the methods used here for gaining insight into the problem,
complex feature engineering, gradient boosting models and modifying the competitions
evaluation metric yielded great results. Had the dataset been larger, perhaps the results would
be even better.

4APPLIED DATA SCIENCE AND ANALYTICS
References:
Natekin, A., & Knoll, A. (2013). Gradient boosting machines, a tutorial. Frontiers in
neurorobotics, 7, 21.
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer,
Berlin, Heidelberg.
Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64-73.
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery
and data mining (pp. 785-794).

1 out of 4

Your All-in-One AI-Powered Toolkit for Academic Success.

+13062052269

info@desklib.com

Available 24*7 on WhatsApp / Email

Company

Tools

Support