Knowledge Discovery from Data.
Iterative step:
Data characterization: Summarize main features on target class.
Data discrimination: Compare main features of target classes.
Frequent pattern discovery: frequent item set
ordered: sequence pattern
structured: structure pattern
Classification & Regression
Clustering: have no knowledge of class.
Outlier analysis
$$ \text{support}(X \Rightarrow Y) := P(X \lor Y) $$
$$ \text{confidence}(X \Rightarrow Y) := P(Y | X) $$
or subjective ones
mode only
Symmetric: carried same weight
Asymmetric: not equally important
mode and median
mean, mode and median
ratio is meaningless here
contrary to the one above
Imbalanced importance: weighted mean
Sensitive to outlier: trimmed mean
For imbalanced data.
For large dataset, the interpolation way to calculate is as follows.
$$ \text{median} = L + \left( \frac{N}{2} + \sum_{l} freq_l \right) \frac{width}{freq_{\text{median}}} $$
where L is the low bound of the range of median, N is the count of data, sum(freq)_l is the sum of all frequency of ranges which is lower than median range, freq_median is the frequency of the median range and width is the width of of median range.
May exists multiple mode
Selected data that split total data equally in size.
Median is a specialized version of this.
Interquartile range (IQR): Q3 - Q1
Five-number summary: ordered(min, Q1, median, Q3, max) vis=boxplot
$$ \left(\frac{\sum(xi^2)}{N}\right)^2 - x_{mean}^2 $$
Square root of variance
univariable
x: percentage
y: value
bivariable
Bar chart: categotical
Histogram: Numerical
Not suitable for high dimension data
Space-filling curve
Hilbert, Grey, Z
Circle segment techinique
TODO
Scatter matrix: half is enough
Parallel coordinates
TODO
Too wired
Chernoff faces
Stick figure
You've reached the end of this page. And you may Go to index or visit my friends.
About me and contacts
Except where otherwise noted, this site is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License