lunedì 18 agosto 2014

Types of data science questions

Descriptive Analysis is the discipline of quantitatively describing the main features of a collection of information; it aims to summarize a sample. This generally means that descriptive statistics, unlike inferential statistics, are not developed based on probability theory. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented.

Goal: describe a set of data (eg. Census Data)

  • The first of data analysis performed
  • Usually applied to census data
  • The description and interpretation are different step
  • Description can usually not be generalized without statistical modelling

Exploratory Data Analysis is an approach to analysing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task

Goal: find relationships you didn't know about 

  • Good for discovering new connection
  • They are also useful for defining new analysis
  • Exploratory analysis are usually not the final say
  • Exploratory analysis alone are not be used for generalizing predicting
  • Correlations does not imply causation

Inferential Analysis is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Inferential statistics are used to test hypotheses and make estimations using sample data. Inferential statistics infer predictions about a larger population that the sample represents. The outcome of statistical inference may be an answer to the question "what should be done next?".

Goal: use a relatively small sample of data to say something about a bigger population 

  • Inference is commonly the goal of statistical models
  • Inference involves estimating both the quantity you care about and your uncertainty about your estimate
  • Inference depends heavily on both the population and the sampling scheme

Predictive Analysis is an area of data mining that deals with extracting information from data and using it to predict trends and behaviour patterns. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown whether it be in the past, present or future.

Goal:  to use data on some object to predict values for another object 

  • If X predicts Y it does not mean that X causes Y
  • Accurate predictions depend heavily on measuring the right variables
  • Although there better and worse prediction models, more data and simple model works really well
  • Prediction is very hard, especially about the future references.

Causal Analysis works by identifying many different levels, and attempting to make synchronized changes at all levels to create a coherent new future

Goal: to find out what happens to one variables when you make another variables change
·         Usually randomized studies are required to identify causation
·         There are approaches studies to inferring causation in non-randomized studies, but they are complicated sensitive assumptions
·         Causal relationships are usually identify as average effects, but may not apply to every individual
·         Causal models are usually the "gold standard" for data analysis

Mechanistic analysis
Goal: Understand the exact changes in variables that lead to changes in other variables for individual objects.
·         Incredibly hard to infer, except in simple situations
·         Usually modelled by a deterministic set of equations (physical/engineering science)
·         Generally the random component of the data is measurement error
·         If the questions are known but the parameters are not, they may be inferred with data analysis 

Definition of data
Data are values of qualitative or quantitative variables, belonging to a set of items
·         Items: sometimes called the population; the set of objects you are interested in
·         Variables: a measurement or characteristic of an item
·         Qualitative: country of origin, sex, treatment
·         Quantitative: height, weight, blood pressure

What do data like?
·         TXT file with several line
·         Use API like Twitter in order to extract information
·         Medical records in a simple and not structured TXT file
·         Video
·         Audio file
·         API from Data Gov, excel files etc.
·         Excel, CSV

The data is the second most important thing
·         The most important thing in data science is the question
·         The second most important is the data
·         Often the data will limit or enable the questions
·         But having data cannot save you if you do not have a question

Experimental Design
Why you should care - an exciting result!
Why you should care - uh oh!
Why you should care - serious trouble!
Know and care about the analysis plan?
It is necessary define which kind of statistical method are you going to use

Formulate a question in advance
Statistical scenario
Variability - Scenario 1 2 3
Confounding
Correlation is no causation

Randomization and blocking
If you can and want to fix a variable
Website always says Obama 2014 on it
If you do not fix a variable, stratify it
If you are testing sign up phrases and have two website colours, use both phrases equally on both
If you cannot fix a variable, randomize it 

What does randomization help?
Prediction
Prediction vs inference
Prediction key quantities
Beware data dredging

Summary
Good experiments
Have replication
Measure variability
Generalize to the problem you care about
Are transparent

Prediction is not inference:
Both can be important

Beware data dredging