Il gioco della felicità

Descriptive Analysis is the discipline of quantitatively describing the main features of a collection of information; it aims to summarize a sample. This generally means that descriptive statistics, unlike inferential statistics, are not developed based on probability theory. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented.

Goal: describe a set of data (eg. Census Data)

The first of data analysis performed

Usually applied to census data

The description and interpretation are different step

Description can usually not be generalized without statistical modelling

Exploratory Data Analysis is an approach to analysing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task

Goal: find relationships you didn't know about

Good for discovering new connection

They are also useful for defining new analysis

Exploratory analysis are usually not the final say

Exploratory analysis alone are not be used for generalizing predicting

Correlations does not imply causation

Inferential Analysis is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Inferential statistics are used to test hypotheses and make estimations using sample data. Inferential statistics infer predictions about a larger population that the sample represents. The outcome of statistical inference may be an answer to the question "what should be done next?".

Goal: use a relatively small sample of data to say something about a bigger population

Inference is commonly the goal of statistical models
Inference involves estimating both the quantity you care about and your uncertainty about your estimate
Inference depends heavily on both the population and the sampling scheme

Predictive Analysis is an area of data mining that deals with extracting information from data and using it to predict trends and behaviour patterns. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown whether it be in the past, present or future.

Goal: to use data on some object to predict values for another object

If X predicts Y it does not mean that X causes Y
Accurate predictions depend heavily on measuring the right variables
Although there better and worse prediction models, more data and simple model works really well
Prediction is very hard, especially about the future references.

Causal Analysis works by identifying many different levels, and attempting to make synchronized changes at all levels to create a coherent new future

Goal: to find out what happens to one variables when you make another variables change

· Usually randomized studies are required to identify causation

· There are approaches studies to inferring causation in non-randomized studies, but they are complicated sensitive assumptions

· Causal relationships are usually identify as average effects, but may not apply to every individual

· Causal models are usually the "gold standard" for data analysis

Mechanistic analysis

Goal: Understand the exact changes in variables that lead to changes in other variables for individual objects.

· Incredibly hard to infer, except in simple situations

· Usually modelled by a deterministic set of equations (physical/engineering science)

· Generally the random component of the data is measurement error

· If the questions are known but the parameters are not, they may be inferred with data analysis

Definition of data

Data are values of qualitative or quantitative variables, belonging to a set of items

· Items: sometimes called the population; the set of objects you are interested in

· Variables: a measurement or characteristic of an item

· Qualitative: country of origin, sex, treatment

· Quantitative: height, weight, blood pressure

What do data like?

· TXT file with several line

· Use API like Twitter in order to extract information

· Medical records in a simple and not structured TXT file

· Video

· Audio file

· API from Data Gov, excel files etc.

· Excel, CSV

The data is the second most important thing

· The most important thing in data science is the question

· The second most important is the data

· Often the data will limit or enable the questions

· But having data cannot save you if you do not have a question

Experimental Design

Why you should care - an exciting result!

Why you should care - uh oh!

Why you should care - serious trouble!

Know and care about the analysis plan?

It is necessary define which kind of statistical method are you going to use

Formulate a question in advance

Statistical scenario

Variability - Scenario 1 2 3

Confounding

Correlation is no causation

Randomization and blocking

If you can and want to fix a variable

Website always says Obama 2014 on it

If you do not fix a variable, stratify it

If you are testing sign up phrases and have two website colours, use both phrases equally on both

If you cannot fix a variable, randomize it

What does randomization help?

Prediction

Prediction vs inference

Prediction key quantities

Beware data dredging

Summary

Good experiments

Have replication

Measure variability

Generalize to the problem you care about

Are transparent

Prediction is not inference:

Both can be important

Beware data dredging

Il gioco della felicità

lunedì 18 agosto 2014

Types of data science questions