How to calculate the principal axis (PCA)

• Calculate mean per feature
• Calculate covariance matrix
• 1/n SUM( (x-x_mean)*(y-y_mean) )
• A-ILambda
• Get lambda
• go back to your covariance matrix and make
• First row = x11*Lambda
• Second row = x12* Lambda
• etc
• Get a relationship between x11, x12 etc for all different lambdas
• Normalise them. Divide them on Sqr(a^2 + b^2 ...)
• You now have your principal axes
ROC - explanation

Scatter plot of the true positive rate TPR and the false positive rate FPR

ID3, when to use?

• Extension of classification and regression tree
• Accept real-valued and missing features
• Uses a pruning mechanism to reduce tree size

Principal component analysis - when?

When we want to visualize high-dimensional data

Work with fewer dimensions

Give examples of different types of partitioning clustering

k-means

k-medoids

CLARAUS

Give examples of different types of GRID-based clustering

STING

waveCluster

CLIQUE

What is the dimensional curse?

As the number of features/dimensions grow, the amount of data we need to generalize and outcome grows exponentially

What are the reasons for overfitting ?

• data contains noise
• not enough data
• model is to complex
Requirements for a good clustering models

->Scalability which is being able to use the model with more data than the sample data

->Being able to use it with different types of attributes (binary,numerical,categorical)

->Interpretability and usability

Discuss ensemble learning

Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models
Pros of ensemble learning

Better accuracy

More consistence

Reduces bias

What are the reasons for underfitting?

• data is not clean
• model has bias
• small amount od data
• model is too simple
