Data Science Investigation for FluffyGroCo's Business

Verified

Added on  2022/10/10

|18
|4235
|450
AI Summary
FluffyGroCo has been impacted by the Crackety Crickling insect infestation which hardens Truffula leaves and makes them inappropriate for textile production. The investigation was conducted using EDA (Exploratory Data Analysis) and chi-test techniques. The dataset provides insights on different parameters such as rainy, temperature, field, and date. The investigation includes data cleaning, data manipulating, application of exploratory data analysis techniques, validating the given deterministic rule, validating the rule for Nextafoo and Uptagoo, and framing potential solutions. The potential solution includes the use of organic larvae solution and data analytics solution. Ethical and security considerations are also discussed.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Table of Contents
Assessment of FluffyGroCo’s business.....................................................................................2
Overview of investigation..........................................................................................................3
Analysis and results...................................................................................................................4
Nominal data analysis.......................................................................................................................4
Deterministic rule..............................................................................................................................7
Validating rule..................................................................................................................................9
Nextafoo...........................................................................................................................................9
Uptagoo...........................................................................................................................................10
Rondadoo........................................................................................................................................11
Ethical and security considerations..........................................................................................12
Potential solutions....................................................................................................................13
Technology Stack - Processing and Storage....................................................................................14
References................................................................................................................................16
Appendix..................................................................................................................................17
Statistics and methodology..............................................................................................................17
1

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Assessment of FluffyGroCo’s business
FluffyGroCo has boosted green economy business by bringing Truffula trees back into
plantation. It harvests Truffula tree leaves to be used in organic textile industry. However, the
business has been impacted by the Crackety Crickling insect infestation which hardens
Truffula leaves and makes them inappropriate for textile production. FluffyGroCo has been
looking for organic solutions to the problem in order to avoid imbalance in the nature.
Among the alternate solutions, fostering bacteria production was a prominent option but it
has few critical issues. First, the timing of treatment is very crucial. A slight change in the
timing may lead to significant adverse effect on the leaves. Secondly, the cost of treatment is
very expensive and lastly, an excessive use of bacteria may threaten the survival of Lazy
Lossy Bears who feed off them. In order to come up with the best optimal solution with
minimum cost, the company has employed a team of biologists, geologists and chemists to
study the impact of different environment conditions which can lead to stunting of Crackety
Crickling larvae. They studied different parameters and concluded that risk of stunting can be
calculated with the help of deterministic rules i.e. if then else scenarios. Along with this,
company has also collected data on different environmental conditions. This data contains
date on which data is collected, field, infestation, rainy and temperature. To validate data, it is
matched with the data supplied by the volunteer farmers. Validation of data is the base of
data science. The collected data has 5479 entries which can be considered to infer
conclusions for the given the case and fulfil the critical aims. The dataset provides following
insights of the data:
Parameter “rainy” depicts that rain plays an important role determining the humidity
level which can foster or de-foster the larvae growth.
Parameter “temperature” shows that at which value along with rain can result in
infestation.
Parameter “field” shows depicts which type of Truffula tree gets affected when
temperature and rainy parameter have the favourable values.
Parameter “date” depicts the date on which data is collected (Pierson and Porway,
2017).
Data science plays an important role in studying the data from statistical point of view. It is a
new and full of capabilities acting as a frontier to allow users to make prediction-based
decisions faster and accurately. It extends and enhances the human perception and
comprehension. It will help in predicting the key factors responsible for infestation and
2
Document Page
potential solutions which can be implemented without affecting the natural balance of the
field.
It can be observed that data has a date column which is an important field in data science
field. It can predict which month, day or year has maximum infestation with corresponding
value of rain and temperature. The investigation shall include data cleaning, data
manipulating, application of exploratory data analysis techniques to explore the data,
validating the given deterministic rule, validating the rule for Nextafoo and Uptagoo and
finally framing potential solutions. EDA application shall include studying relationship
between variables, generating plots using bar chart or pivot tables. It will also include
generating summary statistics. The dataset has three nominal values columns i.e. field, rainy
and infestation. To study relationship among them and other columns chi test and correlation
techniques shall be used.
Overview of investigation
The investigation was conducted using EDA (Exploratory Data Analysis) and chi-test
techniques. The main step of data science is to first clean data i.e. to search whether the data
is in correct format, remove duplicates, replacing NaN format data, etc. The given dataset
was first of all analysed to check whether entire is in appropriate format such as date column
had few entries which were not in DATE format. Then “rainy” and “infestation” columns
were checked whether they have integer values in 0 or 1 only. “Field” column was checked
whether it has any blank entry or any data other than the three names of the plantations.
Second step was to explore the data to find out trends or patterns in the data. Firstly, an
observation was made about nominal data. All three columns – FIELD, RAINY and
INFESTATON contain nominal data. The deterministic rule in the scenario was framed only
for Rondadoo plantation. In order to find out whether there is any relationship among these
nominal data columns, chi test could only be used. Hence, firstly chi test “p” parameter was
calculated in which the null hypothesis got rejected because the value of “p” came out to be
more than the 0.05. Then, relationship with TEMPERATURE value was to be found out as
per the deterministic rule. Since Temperature column does not contain any nominal data, chi
test could not be conducted for it. Hence, correlation method was used to find out relationship
between the infestation of plantation. The value came out to be +0.7 which established that
there is a strong uphill linear relationship between these two values. Next step was to figure
out level of infestation in different months of the year to find out which months witness high
infestation with or without rainy weather. Along with this, it was crucial analyse plantation
3
Document Page
growth in extreme condition months i.e. with very high temperature or very low temperature.
Such months shall not be favourable for plantation of any type.
Analysis and results
The investigation started with preparation of data. Firstly, check for duplicates was carried
but there was duplicates. Format of date seems to be same in all the entries i.e. in dd/mm/yy.
However, there are some dates which are in ‘General’ format. They are first processed using
‘Text-to-column’ option under ‘Data’ tab of excel and then the format is kept as DD-MMM-
YY.
Nominal data analysis
Infestation, field and rainy are nominal data. In order to study the impact of rainy weather and
field on infestation, chi-test needs to be conducted. To perform it, a null hypothesis and
alternate hypothesis are framed.
Null hypothesis: Rain and field do not impact the infestation.
Alternate hypothesis: Rainy weather and field do have impact on the infestation.
Figure 1: Total infestation v/s Field
4

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Figure 2: Chi Test p value
Plot #1 – Total infestation of each field in the different months of respective years:
5
Document Page
Figure 3: Total infestation v/s Month
Analysis results – This plot shows number of fields infested in a particular month in a
particular year. This can help in finding out the month in which the chances of infestation is
high and which field is badly affected. Among the given three fields, Nextafoo is infested
with maximum count in every year from 2015 to 2019.
Plot #2: Infestation of fields from 2015 to 2019
The following graph depicts that Nextafoo always has highest infestation in every year. And
it has maximum fluctuations also. In 2015, it has 7.93% of infestation which got increased in
2016 and then decreased in 2017. It followed same pattern for 2018 and 2019 years.
Analysis results – It can be concluded that among all three plantations, Nextadoo gets
infestation easily irrespective of temperature and rainy weather.
Figure 4: Infestation percentage v/s Field
6
Document Page
Deterministic rule
7

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Following observations have been made using the above mentioned data:
Rondadoo:
o Total Rondadoo plantations = 1826
o Number of observations proving deterministic rule = 1156
o i.e. approximately 63% of total Rondadoo observations satisfy the rule.
Nextafoo:
o Total Nextafoo plantations = 1826
o Number of observations proving deterministic rule = 1106
Approximately 60% of total Nextafoo observations satisfy the rule.
Uptagoo:
8
Document Page
o Total Uptagoo plantations = 1826
o Number of observations proving deterministic rule = 980
o i.e. approximately 53% of total Uptagoo observations satisfy the rule.
It can be inferred that deterministic rule can be applied to rest of the plantations but on
average this rule proves to be valid for 50% of Nextafoo and Uptagoo.
Validating rule
Nextafoo
It becomes essential to study the pattern for those observations where the rule does not give
valid results. It was observed that whether the weather is rainy or not, there is no infestation
in May month of any year.
Following graph shows pattern for those observations where deterministic rule is TRUE:
Figure 5: Nextafoo observations - Follow Rule
The above data depicts that in the absence of rain, there is around 12.24% average infestation
in the month of January where temperature is less than 15. And when it rains in September
month there is hardly any infestation.
9
Document Page
Following graph shows pattern for those observations where deterministic rule is FALSE:
Figure 6: Nextafoo observations - Rule not followed
Potential month for Nextafoo plantation is September. The above graph shows that even if it
rains in September month then there is only 4.90% of infestation which quite lower as
compare to other months but in December month there is hardly any infestation. However,
very low temperature may reduce the chances of infestation but plantation is also not feasible
in such temperatures.
Uptagoo
Following graph shows pattern for those observations where deterministic rule is TRUE:
Figure 7: Uptagoo observations - Rule followed
Following graph shows pattern for those observations where deterministic rule is FALSE:
10

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
1 2 3 4 7 8 9 10 11 12
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00% 13.86%
7.92% 8.91%
3.96%
0.99%
7.92%
11.88%
13.86%
8.91% 7.92%
0.99% 0.99% 0.00% 0.00% 0.00% 0.00% 0.00%
4.95%
1.98%
4.95%
0
1
Figure 8: Uptagoo observations - Rule not followed
Nextafoo also shows a trend where there is no infestation in above graph where in
deterministic rule give FALSE value. Around 47% of Uptagoo observations didn’t follow the
rule which is close to Nextafoo. Hence, same conclusion can be made that September can be
an idle month for planting Uptagoo.
Rondadoo
Following graph shows pattern for those observations where deterministic rule is TRUE:
Figure 9: Rondadoo observations - Rule followed
Following graph shows pattern for those observations where deterministic rule is FALSE:
11
Document Page
Figure 10: Rondadoo observations - Rule not followed
Rondadoo also follows same pattern as that of Uptagoo and Nextafoo wherein there is no
infestation when rule is not followed and best month of plantation turns out to be September
only.
Potential solution
Firstly, month of the year where in all three fields can be planted has been figured out which
is September. Secondly, organic larvae solution may be employed because when rain is less,
then extra larvae will not be able to flourish.
Ethical and security considerations
There are lot of security and privacy concerns when it comes big data because it involves data
collection on a massive scale (Li, 2018). The major question remains that who can see the
data, is it safe in the cloud and can this data be misused somehow. While the data the
Fluffygroco is involved does not deal any consumer data that could ultimately possess an
identity theft risk but it still does involve production data that could benefit competitors and
other player in the textile industry to influence price and discriminate the nature of their
recommendations. There are also concerns that these data may lead to the competitors
through which they can learn trade secrets or gain competitive advantage knowing the
situation of Fluffygroco. Additionally, there are concerns about price discrimination for
chemicals, seeds, pesticides, fertilizers, farming equipment among others or to gain unfair
advantage in commodity or real-estate markets.
Agribusinesses such as Fluffygroco that are already using these type of big data services have
to increasingly reveal their own production and business data in order to gain access to the
benefits provided by the technology while in return they know almost nothing about the back-
12
Document Page
end systems where their data is kept and processed. Some producers and businesses engaged
in the primary industry are concerned with big data getting bigger by the day and involving
the use of advanced networks and the risks of exposures and exploits that haunt such
networks. As mentioned previously, concerns regarding privacy and security has resulted
such producers in opting to store their data locally rather than in the cloud or a third-party
data storage provider. With the security breaches that has occurred in both private and public
domains, it would be unreasonable to believe that data is completely secure.
Potential solutions
Data analytics is typically about finding the smallest of the flaws in any given scenario or
system and then correcting them. More often than not, that flaw is human element. However,
in the case of Fluffygroco the problem seems to be outside of the human flaw. A properly,
designed data analytics solution would give the answers to the current data-oriented problems
faced by the company (Korsmo, 2010). Most importantly, it would use historically available
data and use that to predict and analyse future trends.
Some of the data sources that the designed system could tap into might be:
Fertilizers and pesticides: The fertilizers and pesticides that are being used by the
Fluffygroco for producing the given trees. The said fertilizers and pesticides can be
broken down into nutrients such as Nitrogen (N) 2%, Potassium oxide K2O 26% etc.
Soil Composition: Based on the field’s geographical location, soil properties could
also be obtained.
Weather data: Past weather data from MET department as well as future forecasts can
be sourced (State University, 2019).
Remote sensing: The use of remote sensors to monitor the visible condition of the
trees, branches and leaves and then these data would be converted into spatial
information which is then forwarded to the Geographical Information System for
further mapping.
The data generated and sourced from the process mentioned above is extracted and then fitted
into a domain model of data which is designed for further analysis. This process is typically
known as the ETL or Extract, Transform and Load. This data model encapsulates different
entities such as trees, pesticides, weather patterns, shedding and it collectively describes the
farming activities for fluffygroco.
13

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Technology Stack - Processing and Storage
HPCC
HPCC or high-performance computer cluster would be the high-performance data-intensive
computing platform as being chosen for the said activity. The HPCC would be used to
extract, process, as well as analyse the data so as to develop the given data model. HPCC has
their own programming language known as the Enterprise Control Language or the ECL
which is specifically designed for data processing endeavours (Hpccsystems.com, 2019).
Querying and Searching
Searching as well as Querying are typically driven with the help of emerging use cases since
customers who are new to HPCC are only interested in the data that is currently stored in the
HPCC (Stahlbock, Weiss and Abou-Nasr, 2019). Prospective customers such as those
involved in supply, wholesale, trading, Agri and production businesses such as Fluffygroco
have their own search needs and it need to be analysed by case-to-case basis. Broadly, there
are two strategies that are majorly implemented
Elasticsearch - To browse, slice and dice the data, Elasticsearch is the main method of
doing it. The data subsets are gradually and slowly transferred from the HPCC into
the Elasticsearch module. These data subsets can vary by structure and content, each
depending upon the searching needs for each of the cases. Speed, scalability, stability
as well as Elastic Stack (Logstash, X-Pack, Kibana etc) could make Elasticsearch a
perfect combination and solution for Fluffygroco as it would enable them to do a
power search while also helping facilitate data analytics-based tasks.
Rapid Online XML Inquiry Engine also commonly known as ROXIE - Rapid Online
XML Inquiry Engine is High Performance Compute Cluster’s own delivery engine
for data. It is meant to expose punctual and structured results (Li, 2015). The issue
with ROXIE is that designing and implementing such a system could be very time
consuming and lengthy process, however it returns the results quite rapidly (Pierson
and Porway, 2017).
The importance of data management in business engaged in primary activities such as
producing trees for textile business is established. There are many businesses and farmers
who are providing their farm data to service providers and all these data form a part of the
main database. Ultimately, the larger goal of collecting such data is the hope that one day
these databases would be opensource and become a part of the ultimate big data for open
analysis and management (Garber, 2019). Fluffygroco must realize the important of data
14
Document Page
collected and the use of data science tools in order to scour through these data to uncover
hidden patterns and ultimately help with their truffula situation. Data science and big data has
a growing set of applications for a range of sectors, however the use of it in farm production
and agri-business is gradually set to increase in the future (Fairfield and Shtein, 2014). The
system described above would help Fluffygroco identify and analyse the persistent infestation
that has been hampering their business (Provost and Fawcett, 2013).
15
Document Page
References
Hpccsystems.com. (2019). ECL Basics | HPCC Systems. [online] Available at:
https://hpccsystems.com/training/documentation/ecl-language
reference/html/ECL_Basics.html [Accessed 30 Aug. 2019].
Pierson, L. and Porway, J. (2017). Data science. 1st ed. Hoboken, NJ: John Wiley and Sons,
Inc., pp.34-36.
Provost, F. and Fawcett, T. (2013). Data science for business. 1st ed. Sebastopol (CA):
O'Reilly, pp.21-23.
Stahlbock, R., Weiss, G. and Abou-Nasr, M. (2019). Data Science. 2nd ed. Bloomfield:
C.S.R.E.A., pp.11-13.
State University, I. (2019). What is an HPC cluster | High Performance Computing. [online]
Hpc.iastate.edu. Available at: https://www.hpc.iastate.edu/guides/introduction-to-hpc-
clusters/what-is-an-hpc-cluster [Accessed 30 Aug. 2019].
Fairfield, J. and Shtein, H. (2014). Big Data, Big Problems: Emerging Issues in the Ethics of
Data Science and Journalism. Journal of Mass Media Ethics, 29(1), pp.38-51.
Garber, A. (2019). Data Science: What the Educated Citizen Needs to Know. Issue 1.
Korsmo, F. (2010). The Origins and Principles of the World Data Center System. Data
Science Journal, 8, pp.IGY55-IGY65.
Li, J. (2015). Big Research Data and Data Science. Data Science Journal, 14.
Li, J. (2018). Advancing science and technology with big data analytics. Statistical Analysis
and Data Mining: The ASA Data Science Journal, 11(3), pp.97-97.
16

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Appendix
Statistics and methodology
Investigation began with data cleaning which is the basic principle of data science. Initially it
seemed that date is in appropriate format of dd/mm/yy but when a formula was applied to it
to find out the month and year for further analysis, it was found that most of the date data was
in text format. It was converted into date dd/mm/yy format. Then to explore data, main
challenge was to apply Exploratory data analysis techniques such as boxplot. The data had
mainly nominal variables with which boxplot could not be used. A boxplot was tried to form
between field and temperature but it was not able to help in analysis of impact of either of the
variables on infestation. Then another challenge was to find out the distinctness in the data.
The first step to analyse any data is to first understand it so that it can be interpreted correctly
and approach could be identified on how the data can be used to achieve results. The research
had the objective to understand the impact of rain and temperature on the possibility of
infestation in three fields for which the data was collected. While the data looked simple at
present, when it came to deciding on how to use the data to arrive at a conclusion on
infestation, challenges were faced.
In a primary research investigation, accuracy of data is very important as it determines the
efficiency and efficacy of the results of analysis. When large amount of data is collected,
there can be instances of erroneous data entry, inaccurate representation, or missing data.
Thus, it is important to clean this data to eliminate these discrepancies before the data goes
for an analysis. Thus, data cleaning would be the first step which is the basic principle of data
science. Initially, it was thought that date was in appropriate format of dd/mm/yy but when a
formula was applied to it to find out the month and year for further analysis, it was found that
most of the date data was in text format which rendered the analysis ineffective. Thus, while
cleaning the data, the date fields were converted into the dd/mm/yy format.
When performing exploratory data analysis, a major challenge was faced as the researcher
first thought of using a box plot to visually represent the data which was difficult. The data
had mainly nominal variables with which boxplot could not be used. A boxplot was created
to represent the relations between the two variables, field and temperature. However, it was
not effective enough to help in the analysis of the impact of either of the variables on
infestation. Thus, the idea of box plot was dropped and other regular graphs including bar
chart were used for data visualization.
17
Document Page
Another major challenge was to find out the distinctness in the data which made it difficult to
decide on an appropriate test for the next step in the data analysis. At one point, correlation
and regression looked appropriate but and at other times ANOVA was looking more
appropriate. It was difficult to decide which test could fit the research question. The
researcher then considered Chi-Square test was considered so that the relationships between
two variables could be determined. However, Chi-square could only be used for nominal
variables while temperature was not categorical data. Thus, correlation test was also used.
The challenges were overcome with time invested into exploration and trial. Further,
hypotheses were determined based on the objective of the research which was to determine if
specific combination of rain and temperature makes the fields vulnerable to infestation. A
null hypothesis and alternative hypothesis were formed that included the values of
temperature that were to be explored for its relation to the possibilities of infection in the
fields. Another hypothesis was formed to test the relationship between rain and infestation.
However, since the analysis needed a very specific combination of the two variables, these
hypothesis sounded difficult to prove the points. Thus, the research decided to go ahead with
the use of deterministic rule formula on excel and perform only the EDA. This was important
to use as the two variables were identified to create an impact on the infestation only in
certain situations where a range of temperature existed at the time of rain. Another major
challenge was in figuring out the formula that could be used for validating deterministic rule.
Additional formulas had to be used on excel including AND OR that were nested with the IF
formula.
18
1 out of 18
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]