Data Preparation and Statistics in Data Science
“As a Data Scientist, 80% of your time is dedicated for data preparation or data management and the rest is for the actual analysis and insights,” that is what I have always heard ever since I decided to pursue Data Science. This is important because you have to know your data well in order to provide meaningful analysis.
For starters, data preparation is the process of cleaning raw data to ensure that it is of its best quality before analysis.
Examples of poor data quality include (but do not limit to):
- Having rows with null values,
- Having data points written in different formats,
- Wrong spelling, and
- Having outliers.
These inconsistencies may result to inaccurate analysis and wrong business decisions. Aside from that, good quality data increases the efficiency of analysis since it is easily understandable. This is where Statistics comes in to increase the efficiency of analysis even more.
3 Main Importance of Statistics in Data Science:
- Because of Statistics, you do not have to collect the data of a whole population in order to provide your needed insights. You can compute the optimal sample that will represent the whole population which is not only efficient, it also saves you a lot of money in some cases.
- In Statistics, there are several ways to test your hypothesis which will support your claims quantitatively. Without quantitative support, claims are merely… claims.
- Lastly, aside from being knowledgeable in terms of code, you also have to understand your models and the information they provide, which you can learn in Statistics.
This might be a bit discouraging for people from other majors dreaming to pursue Data Science, but Statistics is just one part of it. Even FTW Foundation had 29% of their Data Science Alumni from non-STEM majors. This just means that as long as you are willing to learn, nothing is impossible.