Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Refactor internal workflows #52

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Conversation

dgkf-roche
Copy link
Collaborator

@dgkf-roche dgkf-roche commented Feb 4, 2025

This is a pretty big redesign of the internals. The design objectives were to:

  1. Unify the "plan" (data.frame) and "design" (task_graph) so that they both leverage a graph structure.
  2. Avoid pre-calculation for data that can be easily re-derived
    • Example: package "alias" and "custom" fields are now redundant with the graph structure. We can determine which is the dev and release versions based on their dependencies, no longer needs to be stored in node.
    • Example: formatting functions used instead of aliases for user-facing communication
  3. Aim to make each task node self-contained so that it can be executed directly
    • Example: instead of deriving lib path locations based on class (task_get_install_lib()), the lib path method is declared when the install task node is created

Use Case

The code that has been refactored is really only for a central, primary use case so far:

run("./pkg")

# or, broken down into its component parts
g <- plan_rev_dep_checks("./pkg")  # declarative "high-level" tasks
g <- task_graph_create(g)  # complete graph with dependent tasks
run(checker$new(g))

Naming conventions changes:

check_design => checker
checks_df, tasks_df => task_graph (plan, but want to consolidate these still)
package_spec => pkg_origin
task_spec => task

Non-goals:

  1. Graph performance. For now I'd prefer to have a focus on data structures that represent the problem best, and we can worry about performance optimizations after we know what will best articulate the problem.

I'll highlight important bits of code in comments

Comment on lines +113 to +132
plan_rev_dep_dev_check <- function(path, revdep, repos) {
origin <- pkg_origin_local(path = path)

tasks <- list(
make_unique_task(seed = "dev", check_task(
origin = pkg_origin_repo(package = revdep, repos = repos),
env = DEFAULT_R_CMD_CHECK_ENVVARS,
args = DEFAULT_R_CMD_CHECK_ARGS,
build_args = DEFAULT_R_CMD_BUILD_ARGS
)),
install_task(origin = origin)
)

planned(sequence_graph(
name = hashes(tasks),
task = tasks,
task_type = lapply(tasks, function(task) class(task)[[1]]),
package = c(revdep, origin$package)
))
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning a reverse-dependency check requires just two declarative nodes: running the R CMD check and installing the appropriate version of the dependency (CRAN or local dev version).

When constructing a full set of rev dep checks, we create pairs (release + dev) of these types of sub-graphs and then merge them into a single task graph.

note: I'm trying to simplify this further, for now just focus on the tasks <- list(...) part

Comment on lines +85 to +98
install_task <- function(
origin,
type = getOption("pkgType"),
INSTALL_opts = NULL,
lib = lib_loc_default(),
...
) {
task <- task(origin = origin, ...)
task$type <- type
task$INSTALL_opts <- INSTALL_opts
task$lib <- lib
class(task) <- c("install_task", class(task))
task
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An install_task already contains information about the library it is intended to install into. This could be a file path when one is already known, but is more likely a method of selecting a lib path such as lib_loc_default()

lib_loc_default is simply a structure(list(), class = c("lib_loc_default", "lib_loc")) used for dispatch when deriving libraries.

Comment on lines +34 to +65
# for each check task in the plan, build a dependency tree and merge it
# into the existing check task subtree
plan_neighborhoods <- lapply(plan_neighborhoods, function(nh) {
subtree <- igraph::induced_subgraph(plan, nh)
deps <- dep_tree(nh[[1]]$task)
igraph::reverse_edges(deps)

subtree <- graph_project(
x = deps,
onto = subtree,
where = c("name" = "package")
)

# set missing dependencies to be installed from repo
missing_task <- is.na(igraph::V(subtree)$task)
igraph::V(subtree)$task[missing_task] <- lapply(
igraph::V(subtree)$package[missing_task],
function(package) {
origin <- try_pkg_origin_repo(package = package, repos = repos)
install_task(origin = origin)
}
)

# re-hash tasks as vertex names (populate missing vertex names)
igraph::V(subtree)$name <- vcapply(igraph::V(subtree)$task, hash)
igraph::V(subtree)$task_type <- vcapply(
igraph::V(subtree)$task,
function(task) class(task)[[1]]
)

subtree
})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a a big component of the new strategy. When we "plan" a set of checks, we don't really care about the details of where packages come from or how dependencies are managed. This function completes the graph with all the task dependencies that need to be performed before the plan can be executed.

To accomplish this, we look at each check task, derive its dependency tree and then project that dependency tree back onto the existing planned tasks.

Let's imagine a situation where we are planning to run a rev dep check for pkgA with a dev version our package, pkgB. In this case, we plan to run R CMD check where our dependency is installed using the local dev source code. When we derive the dependency tree, we want to make sure that this dev dependency continues to use the existing task that has planned its installation.

graph_project projects a set of vertices and edges onto another graph based on one or more attributes. If a vertex already exists, that vertex is retained and all the edges from both graphs are merged.

flowchart TB
    rdcA["revdep check {pkgA}"]
    instB["install dep {pkgB}<br>from source"]

    pkgA["{pkgA}"]
    pkgB["{pkgB}"]
    pkgX["{pkgX}"]
    pkgY["{pkgY}"]
    pkgZ["{pkgZ}"]

    subgraph plan
        direction TB
        rdcA --> instB
    end

    subgraph deps["{pkgA} dependencies"]
        direction TB
        pkgA --> pkgB
        pkgA --> pkgX
        pkgB --> pkgY
        pkgB --> pkgZ
    end
Loading

becomes

flowchart TB
    rdcA["revdep check {pkgA}"]
    instB["install dep {pkgB}<br>from source"]
    instX["install dep {pkgX}<br> from CRAN"]
    instY["install dep {pkgY}<br> from CRAN"]
    instZ["install dep {pkgZ}<br> from CRAN"]

    subgraph task graph
        direction TB
        rdcA --> instB
        rdcA --> instX
        instB --> instY
        instB --> instZ
    end
Loading

}

#' @export
plot.task_graph <- function(x, ...) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some visualization tools to help with debugging. I have aspirations to use the viewer pane to have an actively updating graph as the execution is happening, but nowhere near that feature yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants