Package 'oneclust'

Title: Maximum Homogeneity Clustering for Univariate Data
Description: Maximum homogeneity clustering algorithm for one-dimensional data described in W. D. Fisher (1958) <doi:10.1080/01621459.1958.10501479> via dynamic programming.
Authors: Nan Xiao [aut, cre]
Maintainer: Nan Xiao <[email protected]>
License: GPL-3
Version: 0.3.0
Built: 2024-09-07 04:35:35 UTC
Source: https://github.com/nanxstats/oneclust

Help Index


Masataka Okabe and Kei Ito's Color Universal Design palette

Description

Masataka Okabe and Kei Ito's Color Universal Design palette

Usage

cud(x, shift = TRUE, reverse = FALSE)

Arguments

x

Vector, color index.

shift

Start from the second color in the CUD palette?

reverse

Reverse the color order?

Value

A vector of color hex values.

Examples

barplot(rep(1, 7), col = cud(1:7))
barplot(rep(1, 8), col = cud(1:8, shift = FALSE))
barplot(rep(1, 8), col = cud(1:8, shift = FALSE, reverse = TRUE))

Maximum homogeneity clustering for one-dimensional data

Description

Maximum homogeneity clustering for one-dimensional data

Usage

oneclust(x, k, w = NULL, sort = TRUE)

Arguments

x

Numeric vector, samples to be clustered.

k

Integer, number of clusters.

w

Numeric vector, sample weights (optional). Note that the weights here should be sampling weights (for example, a certain proportion of the population), not frequency weights (for example, number of occurrences).

sort

Should we sort x (and w) before clustering? Default is TRUE. Otherwise the order of the data is respected.

Value

A list containing:

  • cluster - cluster id of each sample.

  • cut - index of the optimal cut points.

References

Fisher, Walter D. 1958. On Grouping for Maximum Homogeneity. Journal of the American Statistical Association 53 (284): 789–98.

Examples

set.seed(42)
x <- sample(c(
  rnorm(50, sd = 0.2),
  rnorm(50, mean = 1, sd = 0.3),
  rnorm(100, mean = -1, sd = 0.25)
))
oneclust(x, 3)

Simulate the levels and their sizes in a high-cardinality feature

Description

Simulate the levels and their sizes in a high-cardinality feature

Usage

sim_postcode_levels(nlevels = 100L, seed = 1001)

Arguments

nlevels

Number of levels to generate.

seed

Random seed.

Value

A data frame of postal codes and sizes.

Note

The code is derived from the example described in the "rare levels" vignette in the vtreat package.

Examples

df_levels <- sim_postcode_levels(nlevels = 500, seed = 42)
head(df_levels)

Simulate a high-cardinality feature and a binary response

Description

Simulate a high-cardinality feature and a binary response

Usage

sim_postcode_samples(
  df_levels,
  n = 2000L,
  threshold = 1000,
  prob = c(0.3, 0.1),
  seed = 1001
)

Arguments

df_levels

Number of levels.

n

Number of samples.

threshold

The threshold for determining if a postal code is rare.

prob

Occurrence probability vector of the class 1 event in rare and non-rare postal codes.

seed

Random seed.

Value

A data frame of samples with postal codes, response labels, and level rarity status.

Note

The code is derived from the example described in the "rare levels" vignette in the vtreat package.

Examples

df_levels <- sim_postcode_levels(nlevels = 500, seed = 42)
df_postcode <- sim_postcode_samples(
  df_levels,
  n = 10000, threshold = 3000, prob = c(0.2, 0.1), seed = 43
)
head(df_postcode)