Title: | Maximum Homogeneity Clustering for Univariate Data |
---|---|
Description: | Maximum homogeneity clustering algorithm for one-dimensional data described in W. D. Fisher (1958) <doi:10.1080/01621459.1958.10501479> via dynamic programming. |
Authors: | Nan Xiao [aut, cre] |
Maintainer: | Nan Xiao <[email protected]> |
License: | GPL-3 |
Version: | 0.3.0 |
Built: | 2024-11-06 04:26:52 UTC |
Source: | https://github.com/nanxstats/oneclust |
Masataka Okabe and Kei Ito's Color Universal Design palette
cud(x, shift = TRUE, reverse = FALSE)
cud(x, shift = TRUE, reverse = FALSE)
x |
Vector, color index. |
shift |
Start from the second color in the CUD palette? |
reverse |
Reverse the color order? |
A vector of color hex values.
barplot(rep(1, 7), col = cud(1:7)) barplot(rep(1, 8), col = cud(1:8, shift = FALSE)) barplot(rep(1, 8), col = cud(1:8, shift = FALSE, reverse = TRUE))
barplot(rep(1, 7), col = cud(1:7)) barplot(rep(1, 8), col = cud(1:8, shift = FALSE)) barplot(rep(1, 8), col = cud(1:8, shift = FALSE, reverse = TRUE))
Maximum homogeneity clustering for one-dimensional data
oneclust(x, k, w = NULL, sort = TRUE)
oneclust(x, k, w = NULL, sort = TRUE)
x |
Numeric vector, samples to be clustered. |
k |
Integer, number of clusters. |
w |
Numeric vector, sample weights (optional). Note that the weights here should be sampling weights (for example, a certain proportion of the population), not frequency weights (for example, number of occurrences). |
sort |
Should we sort |
A list containing:
cluster
- cluster id of each sample.
cut
- index of the optimal cut points.
Fisher, Walter D. 1958. On Grouping for Maximum Homogeneity. Journal of the American Statistical Association 53 (284): 789–98.
set.seed(42) x <- sample(c( rnorm(50, sd = 0.2), rnorm(50, mean = 1, sd = 0.3), rnorm(100, mean = -1, sd = 0.25) )) oneclust(x, 3)
set.seed(42) x <- sample(c( rnorm(50, sd = 0.2), rnorm(50, mean = 1, sd = 0.3), rnorm(100, mean = -1, sd = 0.25) )) oneclust(x, 3)
Simulate the levels and their sizes in a high-cardinality feature
sim_postcode_levels(nlevels = 100L, seed = 1001)
sim_postcode_levels(nlevels = 100L, seed = 1001)
nlevels |
Number of levels to generate. |
seed |
Random seed. |
A data frame of postal codes and sizes.
The code is derived from the example described in the "rare levels"
vignette in the vtreat
package.
df_levels <- sim_postcode_levels(nlevels = 500, seed = 42) head(df_levels)
df_levels <- sim_postcode_levels(nlevels = 500, seed = 42) head(df_levels)
Simulate a high-cardinality feature and a binary response
sim_postcode_samples( df_levels, n = 2000L, threshold = 1000, prob = c(0.3, 0.1), seed = 1001 )
sim_postcode_samples( df_levels, n = 2000L, threshold = 1000, prob = c(0.3, 0.1), seed = 1001 )
df_levels |
Number of levels. |
n |
Number of samples. |
threshold |
The threshold for determining if a postal code is rare. |
prob |
Occurrence probability vector of the class 1 event in rare and non-rare postal codes. |
seed |
Random seed. |
A data frame of samples with postal codes, response labels, and level rarity status.
The code is derived from the example described in the "rare levels"
vignette in the vtreat
package.
df_levels <- sim_postcode_levels(nlevels = 500, seed = 42) df_postcode <- sim_postcode_samples( df_levels, n = 10000, threshold = 3000, prob = c(0.2, 0.1), seed = 43 ) head(df_postcode)
df_levels <- sim_postcode_levels(nlevels = 500, seed = 42) df_postcode <- sim_postcode_samples( df_levels, n = 10000, threshold = 3000, prob = c(0.2, 0.1), seed = 43 ) head(df_postcode)