Descriptive Analysis is the discipline of quantitatively describing
the main features of a collection of information; it aims to
summarize a sample. This generally means that descriptive statistics,
unlike inferential statistics, are not developed based on probability
theory. Even when a data analysis draws its main conclusions using
inferential statistics, descriptive statistics are generally also presented.
Goal:
describe a set of data (eg. Census Data)
- The first of data analysis
performed
- Usually applied to census data
- The description and
interpretation are different step
- Description can usually not be
generalized without statistical modelling
Exploratory Data Analysis is an approach to analysing data
sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for
seeing what the data can tell us beyond the formal modelling or hypothesis
testing task
Goal:
find relationships you didn't know about
- Good
for discovering new connection
- They are also useful for defining
new analysis
- Exploratory analysis are
usually not the final say
- Exploratory analysis alone are
not be used for generalizing predicting
- Correlations does not imply
causation
Inferential Analysis is the process of drawing conclusions from data
that are subject to random variation, for example, observational errors or
sampling variation. Inferential statistics are used to test hypotheses and
make estimations using sample data. Inferential statistics infer predictions
about a larger population that the sample represents. The outcome of
statistical inference may be an answer to the question "what should be
done next?".
Goal:
use a relatively small sample of data to say something about a bigger population
- Inference is commonly the goal
of statistical models
- Inference involves estimating
both the quantity you care about and your uncertainty about your estimate
- Inference depends heavily on
both the population and the sampling scheme
Predictive Analysis is an area of data mining that deals with extracting
information from data and using it to predict trends and behaviour
patterns. Often the unknown event of interest is in the future, but predictive
analytics can be applied to any type of unknown whether it be in the past,
present or future.
Goal:
to use data on some object to predict values for another object
- If X predicts Y it does not mean
that X causes Y
- Accurate predictions depend heavily
on measuring the right variables
- Although there better and worse
prediction models, more data and simple model works really well
- Prediction is very hard, especially
about the future references.
Causal Analysis works by identifying many different levels, and
attempting to make synchronized changes at all levels to create a coherent new
future
Goal: to
find out what happens to one variables when you make another variables change
·
Usually
randomized studies are required to identify causation
·
There
are approaches studies to inferring causation in non-randomized studies, but
they are complicated sensitive assumptions
·
Causal
relationships are usually identify as average effects, but may not apply to
every individual
·
Causal
models are usually the "gold standard" for data analysis
Mechanistic analysis
Goal:
Understand the exact changes in variables that lead to changes in other variables
for individual objects.
·
Incredibly
hard to infer, except in simple situations
·
Usually
modelled by a deterministic set of equations (physical/engineering science)
·
Generally
the random component of the data is measurement error
·
If
the questions are known but the parameters are not, they may be inferred with
data analysis
Definition of data
Data are
values of qualitative or quantitative variables, belonging to a set of items
·
Items:
sometimes called the population; the set of objects you are interested in
·
Variables:
a measurement or characteristic of an item
·
Qualitative:
country of origin, sex, treatment
·
Quantitative:
height, weight, blood pressure
What do
data like?
·
TXT
file with several line
·
Use
API like Twitter in order to extract information
·
Medical
records in a simple and not structured TXT file
·
Video
·
Audio
file
·
API
from Data Gov, excel files etc.
·
Excel,
CSV
The data
is the second most important thing
·
The
most important thing in data science is the question
·
The
second most important is the data
·
Often
the data will limit or enable the questions
·
But
having data cannot save you if you do not have a question
Experimental
Design
Why you should
care - an exciting result!
Why you should
care - uh oh!
Why you should
care - serious trouble!
Know and
care about the analysis plan?
It is
necessary define which kind of statistical method are you going to use
Formulate a question in advance
Statistical
scenario
Variability
- Scenario 1 2 3
Confounding
Correlation
is no causation
Randomization and blocking
If you can
and want to fix a variable
Website
always says Obama 2014 on it
If you do
not fix a variable, stratify it
If you are
testing sign up phrases and have two website colours, use both phrases equally
on both
If you cannot
fix a variable, randomize it
What does randomization help?
Prediction
Prediction
vs inference
Prediction
key quantities
Beware data
dredging
Summary
Good
experiments
Have replication
Measure variability
Generalize
to the problem you care about
Are
transparent
Prediction
is not inference:
Both can be
important
Beware
data dredging