-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 16567: Claims about computational complexity of findInterval #44
Comments
I am not sure I will be able to grasp the internals during R Dev Day, but I would be happy to run numerical experiments to measure the complexity of |
R-devel for a few minutes now contains a version of |
For R Dev Day, I learned the basics of metrics <- NULL
n_data <- 1e8L
data <- runif(n_data)
data_increasing <- sort(data, decreasing = FALSE)
data_decreasing <- sort(data, decreasing = TRUE)
range <- range(data)
for (n_breaks in c(10 ^ seq_len(5))) {
for (rep in seq_len(10)) {
breaks <- seq(from = min(range) - 1, to = max(range) + 1, length.out = n_breaks)
seconds_increasing <- system.time(findInterval(x = data_increasing, vec = breaks))["sys.self"]
seconds_decreasing <- system.time(findInterval(x = data_decreasing, vec = breaks))["sys.self"]
seconds_unsorted <- system.time(findInterval(x = data, vec = breaks))["sys.self"]
new_metrics <- data.frame(
n_breaks = n_breaks,
seconds = c(seconds_increasing, seconds_decreasing, seconds_unsorted),
scenario = c("increasing", "decreasing", "unsorted"),
rep = paste(n_breaks, rep)
)
print(new_metrics)
metrics <- rbind(metrics, new_metrics)
}
} Complexity is mostly as expected: if we fix the length of ggplot(
data = filter(metrics, rep > 1L),
mapping = aes(x = log(n_breaks), y = seconds, color = scenario)
) +
scale_y_log10() +
geom_point(position = position_dodge(width = 0.5)) +
theme_gray(20) There is a slight performance penalty when the initial sort is in decreasing order. Robert hypothesized that there are a few points where the interval search tries a large number of bins before it gets to the correct answer, but then the new choice of bin is the correct initial guess for the next set of points. ggplot(
data = filter(metrics, n_breaks == max(n_breaks)),
mapping = aes(x = scenario, y = seconds, group = rep)
) +
scale_y_log10() +
geom_point() +
geom_line() +
theme_gray(20)
seconds_sort_increasing <- replicate(
10,
system.time(sort(data, decreasing = FALSE))["sys.self"]
)
hist(seconds_sort_increasing) seconds_sort_decreasing <- replicate(
10,
system.time(sort(data, decreasing = TRUE))["sys.self"]
)
hist(seconds_sort_decreasing) |
I posted a comment at https://bugs.r-project.org/show_bug.cgi?id=16567 linking to #44 (comment). |
Bug 16567: Claims about computational complexity of findInterval
Initial submission below, see bugzilla for further discussion.
Participants should confirm if MM's 2015 comment about "The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or is.na(.) --- by memoizing such properties in the corresponding object." ever happened.
The text was updated successfully, but these errors were encountered: