Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 16567: Claims about computational complexity of findInterval #44

Open
shannonpileggi opened this issue Aug 9, 2024 · 4 comments
Open
Assignees
Labels
Documentation Issues in the documentation Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 needs patch Implement the agreed fix and prepare a patch for review

Comments

@shannonpileggi
Copy link

Bug 16567: Claims about computational complexity of findInterval

Initial submission below, see bugzilla for further discussion.

Participants should confirm if MM's 2015 comment about "The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or is.na(.) --- by memoizing such properties in the corresponding object." ever happened.

The documentation of findInterval states:

"...  the internal algorithm uses interval
     search ensuring O(n * log(N)) complexity where ‘n <- length(x)’
     (and ‘N <- length(vec)’).  For (almost) sorted ‘x’, it will be
     even faster, basically O(n)."

In fact the algorithm used to do the interval search is O(N + n log(N)), due to an initial check for sorted order and NAs. This was noted some time ago:

https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html

Note that even if one argues that "internal" in the sense of an internal call down to compiled code, the sentence is still incorrect because both anyNA and is.unsorted result in internal calls.


------------------------------------

The ideal resolution to this (IMO) would be a parameter option to remove the checks. There are many HPC use-cases where sorted vectors are maintained and fast binary search is desired.
@shannonpileggi shannonpileggi added Documentation Issues in the documentation needs patch Implement the agreed fix and prepare a patch for review Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 labels Aug 9, 2024
@wlandau wlandau self-assigned this Aug 12, 2024
@wlandau
Copy link
Member

wlandau commented Aug 12, 2024

I am not sure I will be able to grasp the internals during R Dev Day, but I would be happy to run numerical experiments to measure the complexity of findInterval(), then propose a documentation patch based on what I find. Please let me know if this is of interest.

@mmaechler
Copy link

R-devel for a few minutes now contains a version of findInterval() with 2 new options, checkSorted = TRUE, and checkNA = TRUE. If you set (one or both of) these to FALSE, you can indeed speedup computations (but get wrong results in case you are "lying". {checkNA=FALSE actually may work on some platforms, it did for me on (Fedora 40) Linux {with default gcc compiler and glibc libraries}.

@wlandau
Copy link
Member

wlandau commented Aug 15, 2024

For R Dev Day, I learned the basics of findInterval()'s algorithm with the help of @lawremi and Robert Gentleman, and I ran benchmarks to examine complexity. My results are for R 4.4.0. Here is the simulation study:

metrics <- NULL
n_data <- 1e8L
data <- runif(n_data)
data_increasing <- sort(data, decreasing = FALSE)
data_decreasing <- sort(data, decreasing = TRUE)
range <- range(data)
for (n_breaks in c(10 ^ seq_len(5))) {
  for (rep in seq_len(10)) {
    breaks <- seq(from = min(range) - 1, to = max(range) + 1, length.out = n_breaks)
    seconds_increasing <- system.time(findInterval(x = data_increasing, vec = breaks))["sys.self"]
    seconds_decreasing <- system.time(findInterval(x = data_decreasing, vec = breaks))["sys.self"]
    seconds_unsorted <- system.time(findInterval(x = data, vec = breaks))["sys.self"]
    new_metrics <- data.frame(
      n_breaks = n_breaks,
      seconds = c(seconds_increasing, seconds_decreasing, seconds_unsorted),
      scenario = c("increasing", "decreasing", "unsorted"),
      rep = paste(n_breaks, rep)
    )
    print(new_metrics)
    metrics <- rbind(metrics, new_metrics)
  }
}

Complexity is mostly as expected: if we fix the length of x and vary the length of vec, we see O(log(N)) complexity when x is unsorted and O(1) complexity if x is sorted in increasing order.

ggplot(
  data = filter(metrics, rep > 1L),
  mapping = aes(x = log(n_breaks), y = seconds, color = scenario)
) +
  scale_y_log10() +
  geom_point(position = position_dodge(width = 0.5)) +
  theme_gray(20)

Screenshot 2024-08-15 at 6 31 35 PM

There is a slight performance penalty when the initial sort is in decreasing order. Robert hypothesized that there are a few points where the interval search tries a large number of bins before it gets to the correct answer, but then the new choice of bin is the correct initial guess for the next set of points.

ggplot(
  data = filter(metrics, n_breaks == max(n_breaks)),
  mapping = aes(x = scenario, y = seconds, group = rep)
) +
  scale_y_log10() +
  geom_point() +
  geom_line() +
  theme_gray(20)

Screenshot 2024-08-15 at 6 31 53 PM

findInterval() times in general seem to be faster than the default sort method. Radix of course may be faster.

seconds_sort_increasing <- replicate(
  10,
  system.time(sort(data, decreasing = FALSE))["sys.self"]
)
hist(seconds_sort_increasing)

Screenshot 2024-08-15 at 6 39 06 PM

seconds_sort_decreasing <- replicate(
  10,
  system.time(sort(data, decreasing = TRUE))["sys.self"]
)
hist(seconds_sort_decreasing)

Screenshot 2024-08-15 at 6 39 13 PM

@wlandau
Copy link
Member

wlandau commented Aug 15, 2024

I posted a comment at https://bugs.r-project.org/show_bug.cgi?id=16567 linking to #44 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues in the documentation Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 needs patch Implement the agreed fix and prepare a patch for review
Projects
None yet
Development

No branches or pull requests

3 participants