Analysis, by simple terms means, examining or studying something. Now when we talk about data analysis, it simply means, examining the data or study the data that is available. The analysis may be done using various statistical methods, for example finding the central tendency of the data given. Central tendency in this case, means: mean, median and mode. These simple calculations are used to find out the average, central point, the standard deviation and many more.
For complex tasks, such as classification, clustering, association, prediction and other data mining tasks, such simple statistical methods are not applicable. As the data set is large enough to be work on, such methods won’t be giving accurate results and reliable answers. Thus to overcome this, various data mining tasks are implemented using various statistical languages such as: R, Weka, SAS, SPSS, Python and many more.
These languages come with a set of pre-defined functions, methods and packages, which can be made useful in order to perform complex tasks in data mining. Many of the above languages, mentioned above are open source. Thus, it is easy to use, reliable and also efficient, when the data set is large.
Out of all the above languages mentioned above, two statistical language/software are used mainly for data analysis: R and Python. In this article, our main focus will be on the R tool, i.e. R Programming.
R tool (R Programming):
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and Mac OS. It is an important tool for development in the numeric analysis and machine learning spaces. Was developed at the University Of Auckland, New Zealand by Ross Ihaka and Robert Gentleman.
Some important statistical features present in the R language are as follows:
- R consists of all the major libraries and packages, which makes it useful to run any type of analysis, on various types of data.
- As R is derived initially from the S language, the syntax is very much similar to that of the S languages, and thus the users of S can easily switch over to R.
- R consists of various explanatory data analysis methods, which can be used to model your data in an effective manner, so that it is more understandable and easy to analyse.
- It consists of various techniques such as: linear and non-linear modelling, classification tests, time-series analysis, classification methods, and various methods to cluster your data. (Clustering your data means, forming groups of data, based on similar patterns).
- R can be easily extended through various functions and extensions.
- It runs on almost any standard computing platform or an operating system.
- The functionality of R is divided into a number of modular packages.
- For intensive tasks, high level languages such as: C/C++, FORTRAN code can be linked with R and called at run-time. Also advanced users can write codes in High level languages to manipulate the objects in R.
- The major advantage of using R is, that it is absolutely free!! R is open source.
As we all know, everything that has some advantages, will also have some disadvantages. So let us know look at the drawbacks of using R, as the statistical language.
- Based on a very old technology.
- Very limited support for dynamic or 3 dimensional graphics. (Advancement is going on, to overcome this issue).
- The functionality is based on consumer demand.
- Objects generally are stored in physical memory. (Also improvement is been going on for this issue).
- Not ideal for all possible situations.
Some few basic examples and syntax details about R are as follows:
- In R, the symbol <- is used to initialise a value to the variable or the argument.
- “#” symbol is used for commenting in R
- Vectors, Lists and in other various ways, data can be given as input.
- Various types of files such as: .csv, JSON, XML, XLSX, HDF5 and many more can be read, using various packages and methods.
- Types of data:
- x <- [a,b,c,d] ##Character Data
- x <- [1,2,3,4] ##Numeric Data
- x <- [1L,2L,3L,4L] ##Integer Data
- x <- 1+2i ##Complex Data
- x <- [True, True, False] ##Logical Data
Many other syntactical and semantic features are available in R. To know more you can refer any of the R tutorial online.
R languages yet having some drawbacks, is used largely for data mining tasks and analysis purposes. Also python has its own advantages over R and also some drawbacks, when compared with R. For now, it’s only about R. Will be posting on “Python as statistical language” soon.
Figure1: Kdnuggets Poll
-Article submitted and written by: Akshay Rakesh Toshniwal