Lecture 1¶

Classification - Data to classes

Regression - Predicting a numeric value

Clustering

Different types of problems¶

Classification Problem - MNIST Dataset

Regression - Predicting stock value

Clustering

Automatically identify the data

Data integration¶

Data are created independently
A higher-level abstraction

Statical analysis¶

Collecting data¶

Collecting, exploring and presenting large amounts of data to discover underlying patterns and trends

Data come in two types: - Discrete - Continuous

We have - barchart - piechart, Stem-and-leaf plot - Scatterplot ( it uses caresian coordinates to display values for two variables for set of data) - Form - Direction

Numerical descriptive measures of data

(Central tendency) - Mean - Min - Max - Median - Mode

A sampling method is a procudure for selecting sample elements from a population.

Relationship between variables:¶

Eyeball fit: Fit two points on the plot so that the line passing through them fives a fairly good fit.
Least square fit: Fit a line $y = a + bX$ such that it minimaizes the error S
Correlation coefficient, denoted as r, measures the degree to which two variables movements are associated.
- r = 1 means perfect positive relationship
- r = 1 means a perfect negative relationship
- r = 0 means no relationship

Forecasting¶

An experiment is an action where the result is uncertain
A sample space is all the possible outomes of an experiment, denoted as $S$.
A event is a subset of S

Probability: is the measure of how likely an event is to occur out of the number of possible outcomes.

$p = \frac{The\ number\ of outcomes}{sample space} $

Parameters¶

Sample can be generated by a probability model, where parameters are characteristics of the model

Variance¶

Variance is another parameter of probability model
It is a measure of how spread out it is

Statical analysis¶

Collecting, exploring and presenting large amounts of data to discover underlying patterns and