Pictograms or 3D plots? Neither – Just stick to 2D:

Posted by Koti | Posted in Uncategorized | Posted on 30-12-2009

0

In general any visual representation of data is the most efficient form of putting your point across. A picture is worth a thousand words. It is true, not only in the art world but also in the business & academic worlds. Graphs, all sorts of them, stem, box, scatter, histogram, bar, pie & 3D are all used as windows for looking deep into the otherwise shallow looking & less informative pool of data points. Graphs have the power to harness the might of the human neural network to discover new dimensions which otherwise would always be inherent.  This is exactly why one should be really careful while choosing the type of graph in order to paint the ‘best picture’

Many online & offline magazines make use of Pictograms and 3D plots in order to present data to the public. Pictograms and 3D plots though may look very appealing can usually mislead us while interpreting the data. A couple of examples of such cases are shown below.

example of a bad usage of a pictogram

The above example is a pictogram that aims to show the distribution of advertising budgets across TIME, Newsweek & US News. What is clear is that TIME attracts the largest amount of advertising spending followed by Newsweek & then US News. What is interesting to note is that having a look at the figures quoted in the picture tells you that the  ratio of TIME’s to Newsweek’s share is around 1.6 and that of TIME/US News is around 2.9. This means that TIME should be only about 1.6 times larger than Newsweek and only 2.9 times larger that US News. But according to the pictogram TIME looks definitely more than the above ratios. What is this visual discrepancy?

Well what actually happened is that the picture in the pictogram was magnified and used for the three data points. While magnifying the picture both the height and width have been increased proportionally to avoid distortion. This means that the actual increase in size between the TIME’s figure and Newsweek’s figure is actually 2.56 (1.6*1.6, accounting for both height & width). Similarly the increase in size between Time’s figure and US News figure is 8.4 instead of 2.9.

So, our eyes, which capture the area of the pens rather than only the height, are misled to think that TIME’s share of budgets is way ahead of its competitors while it actually is not

The following is another example of an unnecessary use of 3D plot to present the data.

The aim of this graph is to showcase the drastic increase in the number of Starbucks outlets around the world. If you look at the graph carefully between 1999 & 2003 there is an increase from about 2000 stores to a little more than 6000. This is effectively an increase of about 200%. But the from the graph below it appears that the increase is much bigger than what the actual numbers suggest.

Graph showing increase in the number of outlets over the years

Using a 3D graphs brings in with it another problem, the third dimension. So when a 3D picture is magnified the height, width & breadth are all simultaneously increase to avoid distortion. This is the reason the picture looks much bigger than the 200% increase suggested by the numbers.

So essentially make use of the traditional 2D plots for the best and most accurate representation of your data.

Data Mining & its cousin – Inferential Statistics

Posted by Koti | Posted in Business, Technology | Posted on 23-12-2009

0

Data mining as a field of significant academic research  is on an exponential curve. Though the theoretical underpinnings of this concept have been around for a while the practical use of this field quickly reached a level where researchers and businesses are finding great value.  Information explosion due to the success of the world wide web, general purchasing power of the common man, growing population, Globalization of commerce are only a few of the various reasons for this field to gain such popularity in such a short time.

Initially there was a lot of confusion about the difference between the fields of Inferential Statistics and Data Mining and researchers around the world immediately started working on producing tons of literature to address this issue. Though there is no single universally accepted definition for either of them, many researchers have defined these fields through their individual application perspectives. From what I’ve gathered and known about these fields, this is how I draw a line between them.

Difference between Data mining & Inferential Statistics:

  • Data Mining is a field where you discover hidden patterns from already existing large data sets. These hidden patterns discovered are later used for analysis and decision making scenarios in the area of concern. The process of using the discovered hidden patterns is also called as ‘Knowledge Discovery’. Inferential Statistics is the field where you prove or refute a pre-conceived hypothesis (or a null hypothesis) by performing classical statistical methods on a sample of a given population size.
  • Data mining starts at an already existing database (usually large datasets) and Inferential Statistics generates its own database using sampling methods on data set.
  • Data mining methods employed like classification, clustering, etc., scan the entire dataset in search of hidden patterns while classical statistical methods are run over only a small section of the dataset (the sample).

The above point also infers that Data mining methods are more computer intensive as they have to run through large data sets and hence should be used only when really needed.

So when do you use data mining over inferential statistics?  Well the answer is simple. If you don’t know what you are looking but want to make the best sense of the data you have then use data mining. If you know what you are looking for and want to back it up with proof by checking the data you have you should use inferential statistics.