Data mining as a field of significant academic research is on an exponential curve. Though the theoretical underpinnings of this concept have been around for a while the practical use of this field quickly reached a level where researchers and businesses are finding great value. Information explosion due to the success of the world wide web, general purchasing power of the common man, growing population, Globalization of commerce are only a few of the various reasons for this field to gain such popularity in such a short time.
Initially there was a lot of confusion about the difference between the fields of Inferential Statistics and Data Mining and researchers around the world immediately started working on producing tons of literature to address this issue. Though there is no single universally accepted definition for either of them, many researchers have defined these fields through their individual application perspectives. From what I’ve gathered and known about these fields, this is how I draw a line between them.
Difference between Data mining & Inferential Statistics:
- Data Mining is a field where you discover hidden patterns from already existing large data sets. These hidden patterns discovered are later used for analysis and decision making scenarios in the area of concern. The process of using the discovered hidden patterns is also called as ‘Knowledge Discovery’. Inferential Statistics is the field where you prove or refute a pre-conceived hypothesis (or a null hypothesis) by performing classical statistical methods on a sample of a given population size.
- Data mining starts at an already existing database (usually large datasets) and Inferential Statistics generates its own database using sampling methods on data set.
- Data mining methods employed like classification, clustering, etc., scan the entire dataset in search of hidden patterns while classical statistical methods are run over only a small section of the dataset (the sample).
The above point also infers that Data mining methods are more computer intensive as they have to run through large data sets and hence should be used only when really needed.
So when do you use data mining over inferential statistics? Well the answer is simple. If you don’t know what you are looking but want to make the best sense of the data you have then use data mining. If you know what you are looking for and want to back it up with proof by checking the data you have you should use inferential statistics.