Bug 16567: Claims about computational complexity of findInterval #44

shannonpileggi · 2024-08-09T21:07:46Z

Bug 16567: Claims about computational complexity of findInterval

Initial submission below, see bugzilla for further discussion.

Participants should confirm if MM's 2015 comment about "The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or is.na(.) --- by memoizing such properties in the corresponding object." ever happened.

The documentation of findInterval states:

"...  the internal algorithm uses interval
     search ensuring O(n * log(N)) complexity where ‘n <- length(x)’
     (and ‘N <- length(vec)’).  For (almost) sorted ‘x’, it will be
     even faster, basically O(n)."

In fact the algorithm used to do the interval search is O(N + n log(N)), due to an initial check for sorted order and NAs. This was noted some time ago:

https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html

Note that even if one argues that "internal" in the sense of an internal call down to compiled code, the sentence is still incorrect because both anyNA and is.unsorted result in internal calls.


------------------------------------

The ideal resolution to this (IMO) would be a parameter option to remove the checks. There are many HPC use-cases where sorted vectors are maintained and fast binary search is desired.

The text was updated successfully, but these errors were encountered:

wlandau · 2024-08-12T16:09:50Z

I am not sure I will be able to grasp the internals during R Dev Day, but I would be happy to run numerical experiments to measure the complexity of findInterval(), then propose a documentation patch based on what I find. Please let me know if this is of interest.

mmaechler · 2024-08-13T16:12:01Z

R-devel for a few minutes now contains a version of findInterval() with 2 new options, checkSorted = TRUE, and checkNA = TRUE. If you set (one or both of) these to FALSE, you can indeed speedup computations (but get wrong results in case you are "lying". {checkNA=FALSE actually may work on some platforms, it did for me on (Fedora 40) Linux {with default gcc compiler and glibc libraries}.

wlandau · 2024-08-15T22:59:09Z

For R Dev Day, I learned the basics of findInterval()'s algorithm with the help of @lawremi and Robert Gentleman, and I ran benchmarks to examine complexity. My results are for R 4.4.0. Here is the simulation study:

metrics <- NULL
n_data <- 1e8L
data <- runif(n_data)
data_increasing <- sort(data, decreasing = FALSE)
data_decreasing <- sort(data, decreasing = TRUE)
range <- range(data)
for (n_breaks in c(10 ^ seq_len(5))) {
  for (rep in seq_len(10)) {
    breaks <- seq(from = min(range) - 1, to = max(range) + 1, length.out = n_breaks)
    seconds_increasing <- system.time(findInterval(x = data_increasing, vec = breaks))["sys.self"]
    seconds_decreasing <- system.time(findInterval(x = data_decreasing, vec = breaks))["sys.self"]
    seconds_unsorted <- system.time(findInterval(x = data, vec = breaks))["sys.self"]
    new_metrics <- data.frame(
      n_breaks = n_breaks,
      seconds = c(seconds_increasing, seconds_decreasing, seconds_unsorted),
      scenario = c("increasing", "decreasing", "unsorted"),
      rep = paste(n_breaks, rep)
    )
    print(new_metrics)
    metrics <- rbind(metrics, new_metrics)
  }
}

Complexity is mostly as expected: if we fix the length of x and vary the length of vec, we see O(log(N)) complexity when x is unsorted and O(1) complexity if x is sorted in increasing order.

ggplot(
  data = filter(metrics, rep > 1L),
  mapping = aes(x = log(n_breaks), y = seconds, color = scenario)
) +
  scale_y_log10() +
  geom_point(position = position_dodge(width = 0.5)) +
  theme_gray(20)

There is a slight performance penalty when the initial sort is in decreasing order. Robert hypothesized that there are a few points where the interval search tries a large number of bins before it gets to the correct answer, but then the new choice of bin is the correct initial guess for the next set of points.

ggplot(
  data = filter(metrics, n_breaks == max(n_breaks)),
  mapping = aes(x = scenario, y = seconds, group = rep)
) +
  scale_y_log10() +
  geom_point() +
  geom_line() +
  theme_gray(20)

findInterval() times in general seem to be faster than the default sort method. Radix of course may be faster.

seconds_sort_increasing <- replicate(
  10,
  system.time(sort(data, decreasing = FALSE))["sys.self"]
)
hist(seconds_sort_increasing)

seconds_sort_decreasing <- replicate(
  10,
  system.time(sort(data, decreasing = TRUE))["sys.self"]
)
hist(seconds_sort_decreasing)

wlandau · 2024-08-15T23:00:35Z

I posted a comment at https://bugs.r-project.org/show_bug.cgi?id=16567 linking to #44 (comment).

shannonpileggi added Documentation Issues in the documentation needs patch Implement the agreed fix and prepare a patch for review Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 labels Aug 9, 2024

wlandau self-assigned this Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 16567: Claims about computational complexity of findInterval #44

Bug 16567: Claims about computational complexity of findInterval #44

shannonpileggi commented Aug 9, 2024

wlandau commented Aug 12, 2024 •

edited

Loading

mmaechler commented Aug 13, 2024

wlandau commented Aug 15, 2024

wlandau commented Aug 15, 2024

Bug 16567: Claims about computational complexity of findInterval #44

Bug 16567: Claims about computational complexity of findInterval #44

Comments

shannonpileggi commented Aug 9, 2024

wlandau commented Aug 12, 2024 • edited Loading

mmaechler commented Aug 13, 2024

wlandau commented Aug 15, 2024

wlandau commented Aug 15, 2024

wlandau commented Aug 12, 2024 •

edited

Loading