What is a clustering?

What is a clustering?

TESTE DEIN WISSEN

Set of clusters

the output of cluster analysis

TESTE DEIN WISSEN

What are possible requirements of Clustering in Data Mining?

TESTE DEIN WISSEN
• Scalability: Should be feasible with large datasets
• different types of attributes: boolean, real...
• dynamically changing data: (distribution drifts?)
• clusters of arbitrary shape:
• minimal requirements on domain to determine input parameters: We possibly don't know anything about domain
• handle noise and outliers well:
• insensitive to order of input records
• high dimensionality
• user-specified constraints
• interpretability and usability
TESTE DEIN WISSEN

What types of clusterings are there?

TESTE DEIN WISSEN
• exclusive vs overlapping: instances belong to exactly one cluster vs possibly several
• categorical vs probabilistic: each instance either belongs to a cluster or not vs each instance has for each cluster a probability
• hierarchical vs flat: there is a hierarchy of clusters (like a tree) vs not
• online vs batch: stream of data hast to be handled online for each new received instance vs algorithm has access to all instances at once

TESTE DEIN WISSEN

Why shouldn't you confuse clusters and classes in labeled data?

TESTE DEIN WISSEN

There maybe several clusters for one class

Lösung ausblenden
How to evaluate k?

How to evaluate k?

TESTE DEIN WISSEN

crossvalidation

big advantage over non probabilistic clustering: likelihood can be used to compare clusterings

TESTE DEIN WISSEN

How to handle nominal attributes?

TESTE DEIN WISSEN

just save discrete probability table??

if correlated: table grows exponentially in number attributes

TESTE DEIN WISSEN

How is canonization and graph isomophism related for graphs?

TESTE DEIN WISSEN

If we can solve canonization, we can solve isomophism, by just testing whether the canonized form is the same.

Lösung ausblenden
What is Data Mining?

What is Data Mining?

TESTE DEIN WISSEN
• Knowledge Discovery in Databases (KDD)
(Fayyad 96): “KDD is the non-trivial
process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data.“
• Data Mining: data analysis step within
the KDD process
What is Machine Learning?

What is Machine Learning?

TESTE DEIN WISSEN

Improve on Task T wrt to measure P based on experience E.

eg checkers, games won, play against oneself

TESTE DEIN WISSEN

What is descriptive or predictive pattern mining?

TESTE DEIN WISSEN

descriptive: eg clustering

predictive: eg classification

TESTE DEIN WISSEN

Theoretical formulation of pattern mining

TESTE DEIN WISSEN

language of patterns L

Database D

interestingness predicate q(p , D) = 1 or 0 if p \in L interesting wrt D or not

Lösung ausblenden
What is graph mining?

What is graph mining?

TESTE DEIN WISSEN

Pattern Mining on graphs

Given graph database D, find all subgraphs (patterns) that occur with frequency >= f

