Version solving
Version solving consists in finding a set of packages and versions
that satisfy all the constraints of a given project dependencies.
In Rust, it is the package manager Cargo that takes the dependencies specified
in the Cargo.toml
file and deduces the complete list of exact versions
of packages needed for your code to run.
That includes direct dependencies but also indirect ones,
which are dependencies of your dependencies.
Packages and versions are not restricted to code libraries though.
In fact they could refer to anything where "packages" act as a general
name for things, and "versions" describe the evolution of those things.
Such things could be office documents, laws, cooking recipes etc.
Though if you version cooking recipes, well you are a very organized person.
Semantic versioning
The most common form of versioning scheme for dependencies is called
semantic versioning.
The base idea of semantic versioning is to provide versions as triplets
of the form Major.Minor.Patch
such as 2.4.1
.
The "semantic" term comes from the meaning of each part of the versioning system.
Publishing a new patch version, for example from 2.4.1
to 2.4.2
means
no interface has changed in the provided API of the library.
That may be the case for documentation changes,
or internal performance improvements for example.
Increasing the minor number, for example from 2.4.1
to 2.5.0
,
means that things have been added to the API,
but nothing was changed in the pre-existing API provided by the library.
Finally, increasing the major number, such as from 2.4.1
to 3.0.0
,
means that some parts of the API have changed
and thus may be incompatible with how we are currently using the previous version.
In Rust, packages are called crates and use semantic versioning.
In fact, if you specify a dependency of the form package = "2.4.1"
,
cargo will interpret that as the version constraint 2.4.1 <= v < 3.0.0
for that package.
It does so based on the fact that any version in that range should not break
our code according to the rules of semantic versioning.
For more information on dependencies specifications in Cargo.toml
you can read the Cargo reference book.
Side note on semantic versioning
Some people think that the granularity of semantic versioning is too broad in the case of major version changes. Instead, versions should never be breaking, but use new namespaces for things that change. It brings the same benefits in the large, that what choosing immutability as default brings in the small. For more information on this point of view, I highly recommend "Spec-ulation" by Rich Hickey, creator of the Clojure programming language.
Using the pubgrub crate
The PubGrub algorithm works for any package type
implementing Clone + Eq + Hash + Debug + Display
,
and any version type implementing our Version
trait, defined as follows.
#![allow(unused)] fn main() { pub trait Version: Clone + Ord + Debug + Display { fn lowest() -> Self; fn bump(&self) -> Self; } }
The lowest()
trait method should return the lowest version existing,
and bump(&self)
should return the smallest version stricly
higher than the current one.
For convenience, we already provide the SemanticVersion
type,
which implements Version
for versions expressed as Major.Minor.Patch
.
We also provide the NumberVersion
implementation of Version
,
which is basically just a newtype for non-negative integers, 0, 1, 2, etc.
Note that the complete semver specification also involves pre-release and metadata tags, not handled in our
SemanticVersion
simple type.
Now that we know the Package
and Version
trait requirements,
let's explain how to actually use pubgrub
with a simple example.
Basic example with OfflineDependencyProvider
Let's imagine that we are building a user interface
with a menu containing dropdowns with some icons,
icons that we are also directly using in other parts of the interface.
For this scenario our direct dependencies are menu
and icons
,
but the complete set of dependencies looks like follows.
user_interface
depends onmenu
andicons
menu
depends ondropdown
dropdown
depends onicons
icons
has no dependency
We can model that scenario as follows.
#![allow(unused)] fn main() { use pubgrub::solver::{OfflineDependencyProvider, resolve}; use pubgrub::version::NumberVersion; use pubgrub::range::Range; // Initialize a dependency provider. let mut dependency_provider = OfflineDependencyProvider::<&str, NumberVersion>::new(); // Add all known dependencies. dependency_provider.add_dependencies( "user_interface", 1, [("menu", Range::any()), ("icons", Range::any())], ); dependency_provider.add_dependencies("menu", 1, [("dropdown", Range::any())]); dependency_provider.add_dependencies("dropdown", 1, [("icons", Range::any())]); dependency_provider.add_dependencies("icons", 1, []); // Run the algorithm. let solution = resolve(&dependency_provider, "user_interface", 1).unwrap(); }
As we can see in the previous code example,
the key function of PubGrub version solver is resolve
.
It takes as arguments a dependency provider,
as well as the package and version for which we want to solve
dependencies, here package "user_interface"
at version 1.
The dependency provider must be an instance of a type implementing
the DependencyProvider
trait defined in this crate.
That trait defines methods that the resolver can call
when looking for packages and versions to try in the solver loop.
For convenience and for testing purposes, we already provide
an implementation of a dependency provider called OfflineDependencyProvider
.
As the names suggest, it doesn't do anything fancy
and you have to pre-register all known dependencies with calls to
add_dependencies(package, version, vec_of_dependencies)
before being able to use it in the resolve
function.
Dependencies are specified with a Range
,
ranges being version constraints defining sets of versions.
In most cases, you would use Range::between(v1, v2)
which means any version higher or equal to v1
and strictly lower than v2
.
In the previous example, we just used Range::any()
which basically means "any version".
Writing your own dependency provider
The OfflineDependencyProvider
is very useful for testing
and playing with the API, but would not be usable in more complex
settings like Cargo for example.
In such cases, a dependency provider may need to retrieve
package information from caches, from the disk or from network requests.
Then, you might want to implement DependencyProvider
for your own type.
The DependencyProvider
trait is defined as follows.
#![allow(unused)] fn main() { /// Trait that allows the algorithm to retrieve available packages and their dependencies. /// An implementor needs to be supplied to the [resolve] function. pub trait DependencyProvider<P: Package, V: Version> { /// Decision making is the process of choosing the next package /// and version that will be appended to the partial solution. /// Every time such a decision must be made, /// potential valid packages and version ranges are preselected by the resolver, /// and the dependency provider must choose. /// /// Note: the type `T` ensures that this returns an item from the `packages` argument. fn choose_package_version<T: Borrow<P>, U: Borrow<Range<V>>>( &self, potential_packages: impl Iterator<Item = (T, U)>, ) -> Result<(T, Option<V>), Box<dyn Error>>; /// Retrieves the package dependencies. /// Return [Dependencies::Unknown] if its dependencies are unknown. fn get_dependencies( &self, package: &P, version: &V, ) -> Result<Dependencies<P, V>, Box<dyn Error>>; /// This is called fairly regularly during the resolution, /// if it returns an Err then resolution will be terminated. /// This is helpful if you want to add some form of early termination like a timeout, /// or you want to add some form of user feedback if things are taking a while. /// If not provided the resolver will run as long as needed. fn should_cancel(&self) -> Result<(), Box<dyn Error>> { Ok(()) } } }
As you can see, implementing the DependencyProvider
trait requires you
to implement two functions, choose_package_version
and get_dependencies
.
The first one, choose_package_version
is called by the resolver when a new
package has to be tried.
At that point, the resolver call choose_package_version
with a list
of potential packages and their associated acceptable version ranges.
It's then the role of the dependency retriever to pick a package
and a suitable version in that range.
The simplest decision strategy would be to just pick the first package,
and first compatible version. Provided there exists a method
fn available_versions(package: &P) -> impl Iterator<Item = &V>
for your type,
it could be implemented as follows.
We discuss advanced decision making strategies later.
#![allow(unused)] fn main() { fn choose_package_version<T: Borrow<P>, U: Borrow<Range<V>>>( &self, potential_packages: impl Iterator<Item = (T, U)>, ) -> Result<(T, Option<V>), Box<dyn Error>> { let (package, range) = potential_packages.next().unwrap(); let version = self .available_versions(package.borrow()) .filter(|v| range.borrow().contains(v)) .next(); Ok((package, version.cloned())) } }
The second required method is the get_dependencies
method.
For a given package version, this method should return
the corresponding dependencies.
Retrieving those dependencies may fail due to IO or other issues,
and in such cases the function should return an error.
Even if it does not fail, we want to distinguish the cases
where the dependency provider does not know the answer
and the cases where the package has no dependencies.
For this reason, the return type in case of a success is the
Dependencies
enum, defined as follows.
#![allow(unused)] fn main() { pub enum Dependencies<P: Package, V: Version> { Unknown, Known(DependencyConstraints<P, V>), } pub type DependencyConstraints<P, V> = Map<P, Range<V>>; }
Finally, there is an optional should_cancel
method.
As described in its documentation,
this method is regularly called in the solver loop,
and defaults to doing nothing.
But if needed, you can override it to provide custom behavior,
such as giving some feedback, or stopping early to prevent ddos.
Any useful behavior would require mutability of self
,
and that is possible thanks to interior mutability.
Read on the next section for more info on that!
Caching dependencies in a DependencyProvider
A dependency provider can be reused for multiple resolutions,
usually of the same package and thus asking for the same dependencies.
Since dependencies are generally immutable, caching them is a valid strategy
to avoid slow operations that may be needed such as fetching remote data.
But caching data in a dependency provider while computing the result of get_dependencies
would require &mut self
instead of &self
.
We wanted to encode in the type that the resolve
function cannot mutate
on its own a dependency provider so that's why we chose &self
.
But if the dependency provider wants to mutate itself there is
a pattern called interior mutability
enabling exactly that.
We will now setup a simple example of how to do that.
If by the end you think the trade-off is not worth it and we should
use &mut self
in the method signature, please let us know with an issue in pubgrub repository.
Let DependencyCache
be our cache for dependencies,
with existing methods to fetch them over the network,
#![allow(unused)] fn main() { struct DependencyCache<P: Package, V: Version> { cache: Map<P, BTreeMap<V, DependencyConstraints<P, V>>>, } impl<P: Package, V: Version> DependencyCache<P, V> { /// Initialize cached dependencies. pub fn new() -> Self { ... } /// Fetch dependencies of a given package on the network and store them in the cache. pub fn fetch(&mut self, package: &P, version: &V) -> Result<(), Box<dyn Error>> { ... } /// Extract dependencies of a given package from the cache. pub fn get(&self, package: &P, version: &V) -> Dependencies { ... } } }
We can implement the DependencyProvider
trait in the following way,
using RefCell
for interior mutability in order to cache dependencies.
#![allow(unused)] fn main() { pub struct CachingDependencyProvider<P: Package, V: Version> { cached_dependencies: RefCell<DependencyCache>, } impl<P: Package, V: Version> DependencyProvider<P, V> for CachingDependencyProvider<P, V> { fn choose_package_version<...>(...) -> ... { ... } fn get_dependencies( &self, package: &P, version: &V, ) -> Result<Dependencies<P, V>, Box<dyn Error>> { match self.cached_dependencies.get(package, version) { deps @ Dependencies::Kown(_) => Ok(deps), Dependencies::Unknown => { self.cached_dependencies.borrow_mut().fetch(package, version)?; Ok(self.cached_dependencies.get(package, version)) } } } } }
An example of caching based on the OfflineDependencyProvider
is available in examples/caching_dependency_provider.rs
.
Strategical decision making in a DependencyProvider
In PubGrub, decision making is the process of choosing the next package and version that will be appended to the solution being built. Every time such a decision must be made, potential valid packages are preselected with corresponding valid ranges of versions. Then, there is some freedom regarding which of those package versions to choose.
The strategy employed to choose such package and version cannot change the existence of a solution or not, but can drastically change the performances of the solver, or the properties of the solution. The documentation of Pub (Dart's package manager that uses PubGrub under the hood) states the following:
Pub chooses the latest matching version of the package with the fewest versions that match the outstanding constraint. This tends to find conflicts earlier if any exist, since these packages will run out of versions to try more quickly. But there's likely room for improvement in these heuristics.
In our implementation of PubGrub, decision making responsibility is divided into two pieces.
The resolver takes care of making a preselection for potential packages
and corresponding ranges of versions.
Then it's the dependency provider that has the freedom of employing
the strategy of picking a single package version within
the choose_package_version
method.
#![allow(unused)] fn main() { fn choose_package_version<T: Borrow<P>, U: Borrow<Range<V>>>( &self, potential_packages: impl Iterator<Item = (T, U)>, ) -> Result<(T, Option<V>), Box<dyn Error>>; }
Picking a package
Potential packages basically are packages that appeared in the dependencies of a package we've already added to the solution. So once a package appear in potential packages, it will continue to be proposed as such until we pick it, or a conflict shows up and the solution is backtracked before needing it.
Imagine that one such potential package is limited to a range containing no existing version, we are heading directly to a conflict! So we have better dealing with that conflict as soon as possible, instead of delaying it for later since we will have to backtrack anyway. Consequently, we always want to pick first a conflictual potential package with no valid version. Similarly, potential packages with only one valid version give us no choice and limit options going forward, so we might want to pick such potential packages before others with more version choices. Generalizing this strategy to picking the potential package with the lowest number of valid versions is a rather good heuristic performance-wise.
This strategy is the one employed by the OfflineDependencyProvider
.
For convenience, we also provide a helper function choose_package_with_fewest_versions
directly embedding this strategy.
It can be used directly in choose_package_version
if provided
a helper function to retrieve existing versions of a package
list_available_versions: Fn(&P) -> Iterator<Item = V>
.
Picking a version
By default, the version returned by the helper function
choose_package_with_fewest_versions
is the first compatible one
in the iterator returned by list_available_versions
for the chosen package.
So you can order the iterator with preferred versions first
and they will be picked by the solver.
This is very convenient to easily switch between a dependency provider
that picks the most recent compatible packages and one that chooses instead
the oldest compatible versions.
Such behavior may be desirable for checking that dependencies lower bounds
still pass the code tests for example.
In general, letting the dependency provider choose a version in
choose_package_version
provides a great deal of flexibility and enables things like
- choosing the newest versions,
- choosing the oldest versions,
- choosing already downloaded versions,
- choosing versions specified in a lock file,
and many other desirable behaviors for the resolver, controlled directly by the dependency provider.
Solution and error reporting
When everything goes well, the algorithm finds and returns
a set of packages and versions satisfying all the constraints
of direct and indirect dependencies.
Sometimes however, there is no solution because dependencies are incompatible.
In such cases, the algorithm returns a PubGrubError::NoSolution(derivation_tree)
where the provided derivation tree is a custom binary tree
containing the chain of reasons why there is no solution.
All the items in the tree are called incompatibilities and may be of two types, either "external" or "derived". Leaves of the tree are external incompatibilities, and nodes are derived. External incompatibilities express facts that are independent of the way this algorithm is implemented such as
- dependencies: package "a" at version 1 depends on package "b" at version 4
- missing dependencies: dependencies of package "a" are unknown
- absence of version: there is no version of package "a" higher than version 5
In contrast, derived incompatibilities are obtained during the algorithm execution by deduction, such as if "a" depends on "b" and "b" depends on "c", then "a" depends on "c".
Processing a derivation tree in a custom way to generate a failure report
that is human-friendly is not an easy task.
For convenience, this crate provides a DefaultStringReporter
able to convert a derivation tree into a human-friendly String
explanation of the failure.
You may use it as follows.
#![allow(unused)] fn main() { use pubgrub::solver::resolve; use pubgrub::report::{DefaultStringReporter, Reporter}; use pubgrub::error::PubGrubError; match resolve(&dependency_provider, root_package, root_version) { Ok(solution) => println!("{:?}", solution), Err(PubGrubError::NoSolution(mut derivation_tree)) => { derivation_tree.collapse_no_versions(); eprintln!("{}", DefaultStringReporter::report(&derivation_tree)); } Err(err) => panic!("{:?}", err), }; }
Notice that we also used collapse_no_versions()
above.
This method simplifies the derivation tree to get rid of the
NoVersions
external incompatibilities in the derivation tree.
So instead of seeing things like this in the report:
Because there is no version of foo in 1.0.1 <= v < 2.0.0
and foo 1.0.0 depends on bar 2.0.0 <= v < 3.0.0,
foo 1.0.0 <= v < 2.0.0 depends on bar 2.0.0 <= v < 3.0.0.
...
You will directly see something like:
Because foo 1.0.0 <= v < 2.0.0 depends on bar 2.0.0,
...
Writing your own error reporting logic
The DerivationTree
is a custom binary tree
where leaves are external incompatibilities, defined as follows,
#![allow(unused)] fn main() { pub enum External<P: Package, V: Version> { /// Initial incompatibility aiming at picking the root package for the first decision. NotRoot(P, V), /// No versions from range satisfy given constraints. NoVersions(P, Range<V>), /// Dependencies of the package are unavailable for versions in that range. UnavailableDependencies(P, Range<V>), /// Incompatibility coming from the dependencies of a given package. FromDependencyOf(P, Range<V>, P, Range<V>), } }
and nodes are derived incompatibilities, defined as follows.
#![allow(unused)] fn main() { pub struct Derived<P: Package, V: Version> { /// Terms of the incompatibility. pub terms: Map<P, Term<V>>, /// Indicate if that incompatibility is present multiple times /// in the derivation tree. /// If that is the case, it has a unique id, provided in that option. /// Then, we may want to only explain it once, /// and refer to the explanation for the other times. pub shared_id: Option<usize>, /// First cause. pub cause1: Box<DerivationTree<P, V>>, /// Second cause. pub cause2: Box<DerivationTree<P, V>>, } }
The terms
hashmap contains the terms of the derived incompatibility.
The rule is that terms of an incompatibility are terms that
cannot be all true at the same time.
So a dependency can for example be expressed with an incompatibility
containing a positive term, and a negative term.
For example, "root"
at version 1 depends on "a"
at version 4,
can be expressed by the incompatibility {root: 1, a: not 4}
.
A missing version can be expressed by an incompatibility
with a single term. So for example, if version 4 of package "a"
is missing,
it can be expressed with the incompatibility {a: 4}
which forbids that version.
If you want to write your own reporting logic,
I'd highly suggest a good understanding of incompatibilities
by reading first the section of this book on internals of the PubGrub algorithm.
The root of the derivation tree is usually a derived incompatibility
containing a single term such as { "root": 1 }
if we were trying to solve dependencies for package "root"
at version 1.
Imagine we had one simple dependency on "a"
at version 4,
but somehow that version does not exist.
Then version solving would fail and return a derivation tree looking like follows.
Derived { root: 1 }
├─ External dependency { root: 1, a: not 4 }
└─ External no version { a: 4 }
The goal of error reporting is to transform that tree into a friendly human-readable output. It could be something like:
This project depends on version 4 of package a which does not exist
so version solving failed.
One of the subtleties of error reporting is that the binary derivation tree is actually a DAG (directed acyclic graph). Occasionally, an incompatibility may cause multiple derived incompatibilities in the same derivation tree such as below, where the incompatibility 2 is used to derive both incompatibilities 4 and 5.
+----------+ +----------+
+------->|incompat 4|------->|incompat 1|
| +----------+ +----------+
| |
+----------+ | +----------+
(root) |incompat 6| +---------->|incompat 2| (shared)
+----------+ | +----------+
| |
| +----------+ +----------+
+------->|incompat 5|------->|incompat 3|
+----------+ +----------+
For ease of processing, the DerivationTree
duplicates such nodes in the tree,
but their shared_id
attribute will hold a Some(id)
variant.
In error reporting, you may want to check if you already gave an explanation
for a shared derived incompatibility,
and in such cases maybe use line references instead of re-explaning the same thing.
Advanced usage and limitations
By design, the current implementation of PubGrub is rather well suited to handle a dependency system with the following constraints:
- Packages are uniquely identified.
- Versions are in a discrete set, with a total order.
- The successor of a given version is always uniquely defined.
- Dependencies of a package version are fixed.
- Exactly one version must be selected per package depended on.
The fact that packages are uniquely identified (1) is perhaps the only constraint that makes sense for all common dependency systems.
But for the rest of the constraints, they are all inadequate for some common real-world dependency systems.
For example, it's possible to have dependency systems where order is not required for versions (2).
In such systems, dependencies must be specified with exact sets of compatible versions, and bounded ranges make no sense.
Being able to uniquely define the successor of any version (3) is also a constraint that is not a natural fit if versions have a system of pre-releases.
Indeed, what is the successor of 2.0.0-alpha
?
We can't tell if that is 2.0.0
or 2.0.0-beta
or 2.0.0-whatever
.
Having fixed dependencies (4) is also not followed in programming languages allowing optional dependencies.
In Rust packages, optional dependencies are called "features" for example.
Finally, restricting solutions to only one version per package (5) is also too constraining for dependency systems allowing breaking changes.
In cases where packages A and B both depend on different ranges of package C, we sometimes want to be able to have a solution where two versions of C are present, and let the compiler decide if their usages of C in the code are compatible.
In the following subsections, we try to show how we can circumvent those limitations with clever usage of dependency providers.
Optional dependencies
Sometimes, we want the ability to express that some functionalities are optional. Since those functionalities may depend on other packages, we also want the ability to mark those dependencies optional.
Let's imagine a simple scenario in which we are developing package "A" and depending on package "B", itself depending on "C". Now "B" just added new functionalities, very convenient but also very heavy. For architectural reasons, they decided to release them as optional features of the same package instead of in a new package. To enable those features, one has to activate the "heavy" option of package "B", bringing a new dependency to package "H". We will mark optional features in dependencies with a slash separator "/". So we have now "A" depends on "B/heavy" instead of previously "A" depends on "B". And the complete dependency resolution is thus now ["A", "B/heavy", "C", "H"].
But what happens if our package "A" start depending on another package "D", which depends on "B" without the "heavy" option? We would now have ["A", "B", "B/heavy", "C", "D", "H"]. Is this problematic? Can we conciliate the fact that both "B" and "B/heavy" are in the dependencies?
Strictly additive optional dependencies
The most logical solution to this situation is to require that optional features and dependencies are strictly additive. Meaning "B/heavy" is entirely compatible with "B" and only brings new functions and dependencies. "B/heavy" cannot change dependencies of "B" only adding new ones. Once this hypothesis is valid for our dependency system, we can model "B/heavy" as a different package entirely, depending on both "B" and "H", leading to the solution ["A", "B", "B/heavy", "C", "D", "H"]. Whatever new optional features that get added to "B" can similarly be modeled by a new package "B/new-feat", also depending on "B" and on its own new dependencies. When dependency resolution ends, we can gather all features of "B" that were added to the solution and compile "B" with those.
Dealing with versions
In the above example, we eluded versions and only talked about packages. Adding versions to the mix actually does not change anything, and solves the optional dependencies problem very elegantly. The key point is that an optional feature package, such as "B/heavy", would depend on its base package, "B", exactly at the same version. So if the "heavy" option of package "B" at version v = 3 brings a dependency to "H" at v >= 2, then we can model dependencies of "B/heavy" at v = 3 by ["B" @ v = 3, "H" @ v >= 2].
Example implementation
A complete example implementation is available in the optional-deps
crate of the advanced_dependency_providers
repository.
Let's give an explanation of that implementation.
For the sake of simplicity, we will consider packages of type String
and versions of type NumberVersion
, which is just a newtype around u32
implementing the Version
trait.
Defining an index of packages
We define an Index
, storing all dependencies (Deps
) of every package version in a double map, first indexed by package, then by version.
#![allow(unused)] fn main() { // Use NumberVersion, which are simple u32 for the versions. use pubgrub::version::NumberVersion as Version; /// Each package is identified by its name. pub type PackageName = String; /// Global registry of known packages. pub struct Index { /// Specify dependencies of each package version. pub packages: Map<PackageName, BTreeMap<Version, Deps>>, } }
Dependencies listed in the Index
include both mandatory and optional dependencies.
Optional dependencies are identified, grouped, and gated by an option called a "feature".
#![allow(unused)] fn main() { pub type Feature = String; pub struct Deps { /// The regular, mandatory dependencies. pub mandatory: Map<PackageName, Dep>, /// The optional, feature-gated dependencies. pub optional: Map<Feature, Map<PackageName, Dep>>, } }
Finally, each dependency is specified with a version range, and with a set of activated features.
#![allow(unused)] fn main() { pub struct Dep { /// The range dependended upon. pub range: Range<Version>, /// The activated features for that dependency. pub features: Set<Feature>, } }
For convenience, we added the add_deps
and add_feature
functions to help building an index in the tests.
#![allow(unused)] fn main() { let mut index = Index::new(); // Add package "a" to the index at version 0 with no dependency. index.add_deps("a", 0, &[]); // Add package "a" at version 1 with a dependency to "b" at version "v >= 1". index.add_deps("a", 1, &[("b", 1.., &[])]); // Add package "c" at version 1 with a dependency to "d" at version "v < 4" with the feature "feat". index.add_deps("c", 1, &[("d", ..4, &["feat"])]); // Add feature "feat" to package "d" at version 1 with the optional dependency to "f" at version "v >= 1". // If "d" at version 1 does not exist yet in the index, this also adds it with no mandatory dependency, // as if we had called index.add_deps("d", 1, &[]). index.add_feature("d", 1, "feat", &[("f", 1.., &[])]); }
Implementing a dependency provider for the index
Now that our Index
is ready, let's implement the DependencyProvider
trait on it.
As we explained before, we'll need to differenciate optional features from base packages, so we define a new Package
type.
#![allow(unused)] fn main() { /// A package is either a base package like "a", /// or a feature package, corresponding to a feature associated to a base package. pub enum Package { Base(String), Feature { base: String, feature: String }, } }
Let's implement the first function required by a dependency provider, choose_package_version
.
For that we defined the base_pkg()
method on a Package
that returns the string of the base package.
And we defined the available_versions()
method on an Index
to list existing versions of a given package.
Then we simply called the choose_package_with_fewest_versions
helper function provided by pubgrub.
#![allow(unused)] fn main() { fn choose_package_version<T: Borrow<Package>, U: Borrow<Range<Version>>>( &self, potential_packages: impl Iterator<Item = (T, U)>, ) -> Result<(T, Option<Version>), Box<dyn std::error::Error>> { Ok(pubgrub::solver::choose_package_with_fewest_versions( |p| self.available_versions(p.base_pkg()).cloned(), potential_packages, )) } }
This was very straightforward.
Implementing the second function, get_dependencies
, requires just a bit more work but is also quite easy.
The exact complete version is available in the code, but again, for the sake of simplicity and readability, let's just deal with the happy path.
#![allow(unused)] fn main() { fn get_dependencies( &self, package: &Package, version: &Version, ) -> Result<Dependencies<Package, Version>, ...> { // Retrieve the dependencies in the index. let deps = self.packages .get(package.base_pkg()).unwrap() .get(version).unwrap(); match package { // If we asked for a base package, we simply return the mandatory dependencies. Package::Base(_) => Ok(Dependencies::Known(from_deps(&deps.mandatory))), // Otherwise, we concatenate the feature deps with a dependency to the base package. Package::Feature { base, feature } => { let feature_deps = deps.optional.get(feature).unwrap(); let mut all_deps = from_deps(feature_deps); all_deps.insert( Package::Base(base.to_string()), Range::exact(version.clone()), ); Ok(Dependencies::Known(all_deps)) }, } } }
Quite easy right?
The helper function from_deps
is where each dependency is transformed from its String
form into its Package
form.
In pseudo-code (not compiling, but you get how it works) it looks like follows.
#![allow(unused)] fn main() { /// Helper function to convert Index deps into what is expected by the dependency provider. fn from_deps(deps: &Map<String, Dep>) -> DependencyConstraints<Package, Version> { deps.iter().map(create_feature_deps) .flatten() .collect() } fn create_feature_deps( (base_pkg, dep): (&String, &Dep) ) -> impl Iterator<Item = (Package, Range<Version>)> { if dep.features.is_empty() { // If the dependency has no optional feature activated, // we simply add a dependency to the base package. std::iter::once((Package::Base(base_pkg), dep.range)) } else { // Otherwise, we instead add one dependency per activated feature, // but not to the base package itself (we could but no need). dep.features.iter().map(|feat| { (Package::Feature { base: base_pkg, feature: feat }, dep.range) }) } } }
We now have implemented the DependencyProvider
trait.
The only thing left is testing it with example dependencies.
For that, we setup few helper functions and we wrote some tests, that you can run with a call to cargo test --lib
.
Below is one of those tests.
#![allow(unused)] fn main() { #[test] fn success_when_multiple_features() { let mut index = Index::new(); index.add_deps("a", 0, &[("b", .., &["feat1", "feat2"])]); index.add_feature("b", 0, "feat1", &[("f1", .., &[])]); index.add_feature("b", 0, "feat2", &[("f2", .., &[])]); index.add_deps::<R>("f1", 0, &[]); index.add_deps::<R>("f2", 0, &[]); assert_map_eq( &resolve(&index, "a", 0).unwrap(), &select(&[ ("a", 0), ("b", 0), ("b/feat1", 0), ("b/feat2", 0), ("f1", 0), ("f2", 0), ]), ); } }
Allowing multiple versions of a package
One of the main goals of PubGrub is to pick one version per package depended on, under the assumption that at most one version per package is allowed. Enforcing this assumption has two advantages. First, it prevents data and functions interactions of the same library at different, potentially incompatible versions. Second, it reduces the code size and therefore also the disk usage and the compilation times.
However, under some circonstances, we may relax the "single version allowed" constraint. Imagine that our package "a" depends on "b" and "c", and "b" depends on "d" @ 1, and "c" depends on "d" @ 2, with versions 1 and 2 of "d" incompatible with each other. With the default rules of PubGrub, that system has no solution and we cannot build "a". Yet, if our usages of "b" and "c" are independant with regard to "d", we could in theory build "b" with "d" @ 1 and build "c" with "d" @ 2.
Such situation is sufficiently common that most package managers allow multiple versions of a package. In Rust for example, multiple versions are allowed as long as they cross a major frontier, so 0.7 and 0.8 or 2.0 and 3.0. So the question now is can we circumvent this fundamental restriction of PubGrub? The short answer is yes, with buckets and proxies.
Buckets
We saw that implementing optional dependencies required the creation of on-demand feature packages, which are virtual packages created for each optional feature. In order to allow for multiple versions of the same package to coexist, we are also going to play smart with packages. Indeed, if we want two versions to coexist, there is only one possibility, which is that they come from different packages. And so we introduce the concept of buckets. A package bucket is a set of versions of the same package that cannot coexist in the solution of a dependency system, basically just like before. So for Rust crates, we can define one bucket per major frontier, such as one bucket for 0.1, one for 0.2, ..., one for 1.0, one for 2.0, etc. As such, versions 0.7.0 and 0.7.3 would be in the same bucket, but 0.7.0 and 0.8.0 would be in different buckets and could coexist.
Are buckets enought? Not quite, since once we introduce buckets, we also need a way to depend on multiple buckets alternatively.
Indeed, most packages should have dependencies on a single bucket, because it doesn't make sense to depend on potentially incompatible versions at the same time.
But rarely, dependencies are written such that they can depend on multiple buckets, such as if one write v >= 2.0
.
Then, any version from the 2.0 bucket would satisfy it, as well as any version from the 3.0 or any other later bucket.
Thus, we cannot write "a depends on b in bucket 2.0".
So how do we write dependencies of "a"?
That's where we introduce the concept of proxies.
Proxies
A proxy package is an intermediate on-demand package, placed just between one package and one of its dependencies. So if we need to express that package "a" has a dependency on package "b" for different buckets, we create the intermediate proxy package "a->b". Then we can say instead that package "a" depends on any version of the proxy package "a->b". And then, we create one proxy version for each bucket. So if there exists the following five buckets for "b", 0.1, 0.2, 1.0, 2.0, 3.0, we create five corresponding versions for the proxy package "a->b". And since our package "a" depends on any version of the proxy package "a->b", it will be satisfied as soon as any bucket of "b" has a version picked.
Example
We will consider versions in the form major.minor
with a major component starting at 1, and a minor component starting at 0.
The smallest version is 1.0, and each major component represents a bucket.
Note that we could start versions at 0.0, but since different dependency system tends to interpret versions before 1.0 differently, we will simply avoid that problem by saying versions start at 1.0.
For convenience, we will use a string notation for proxies and buckets. Buckets will be indicated by a "#", so "a#1" is the 1.0 bucket of package "a", and "a#2" is the 2.0 bucket of package "a". And we will use "@" to denote exact versions, so "a" @ 1.0 means package "a" at version 1.0. Proxies will be represented by the arrow "->" as previously mentioned. Since a proxy is tied to a specific version of the initial package, we also use the "@" symbol in the name of the proxy package. For example, if "a" @ 1.4 depends on "b", we create the proxy package "a#1@1.4->b". It's a bit of a mouthful, but once you learn how to read it, it makes sense.
Note that we might be tempted to just remove the version part of the proxy, so "a#1->b" instead of "a#1@1.4->b". However, we might have "a" @ 1.4 depending on "b" in range
v >= 2.2
and have "a" @ 1.5 depending on "b" in rangev >= 2.6
. Both dependencies would map to the same "a#1->b" proxy package, but we would not know what to put in the dependency of "a#1->b" to the "b#2" bucket. Should it be "2.2 <= v < 3.0" as in "a" @ 1.4, or should it be "2.6 <= v < 3.0" as in "a" @ 1.5? That is why each proxy package is tied to an exact version of the source package.
Let's consider the following example, with "a" being our root package.
- "a" @ 1.4 depends on "b" in range
1.1 <= v < 2.9
- "b" @ 1.3 depends on "c" at version
1.1
- "b" @ 2.7 depends on "d" at version
3.1
Using buckets and proxies, we can rewrite this dependency system as follows.
- "a#1" @ 1.4 depends on "a#1@1.4->b" at any version (we use the proxy).
- "a#1@1.4->b" proxy exists in two versions, one per bucket of "b".
- "a#1@1.4->b" @ 1.0 depends on "b#1" in range
1.1 <= v < 2.0
(the first bucket intersected with the dependency range). - "a#1@1.4->b" @ 2.0 depends on "b#2" in range
2.0 <= v < 2.9
(the second bucket intersected with the dependency range). - "b#1" @ 1.3 depends on "c#1" at version
1.1
. - "b#2" @ 2.7 depends on "d#3" at version
3.1
.
There are potentially two solutions to that system. The one with the newest versions is the following.
- "a#1" @ 1.4
- "a#1@1.4->b" @ 2.0
- "b#2" @ 2.7
- "d#3" @ 3.1
Finally, if we want to express the solution in terms of the original packages, we just have to remove the proxy packages from the solution.
Example implementation
A complete example implementation of this extension allowing multiple versions is available in the allow-multiple-versions
crate of the advanced_dependency_providers
repository.
In that example, packages are of the type String
and versions of the type SemanticVersion
defined in pubgrub, which does not account for pre-releases, just the (Major, Minor, Patch) format of versions.
Defining an index of packages
Inside the index.rs
module, we define a very basic Index
, holding all packages known, as well as a helper function add_deps
easing the writing of tests.
#![allow(unused)] fn main() { /// Each package is identified by its name. pub type PackageName = String; /// Alias for dependencies. pub type Deps = Map<PackageName, Range<SemVer>>; /// Global registry of known packages. pub struct Index { /// Specify dependencies of each package version. pub packages: Map<PackageName, BTreeMap<SemVer, Deps>>, } // Initialize an empty index. let mut index = Index::new(); // Add package "a" to the index at version 1.0.0 with no dependency. index.add_deps::<R>("a", (1, 0, 0), &[]); // Add package "a" to the index at version 2.0.0 with a dependency to "b" at versions >= 1.0.0. index.add_deps("a", (2, 0, 0), &[("b", (1, 0, 0)..)]); }
Implementing a dependency provider for the index
Since our Index
is ready, we now have to implement the DependencyProvider
trait for it.
As explained previously, we'll need to differenciate packages representing buckets and proxies, so we define the following new Package
type.
#![allow(unused)] fn main() { /// A package is either a bucket, or a proxy between a source and a target package. pub enum Package { /// "a#1" Bucket(Bucket), /// source -> target Proxy { source: (Bucket, SemVer), target: String, }, } /// A bucket corresponds to a given package, and match versions /// in a range identified by their major component. pub struct Bucket { pub name: String, // package name pub bucket: u32, // 1 maps to the range 1.0.0 <= v < 2.0.0 } }
In order to implement the first required method, choose_package_version
, we simply reuse the choose_package_with_fewest_versions
helper function provided by pubgrub.
That one requires a list of available versions for each package, so we have to create that list.
As explained previously, listing the existing (virtual) versions depend on if the package is a bucket or a proxy.
For a bucket package, we simply need to retrieve the original versions and filter out those outside of the bucket.
#![allow(unused)] fn main() { match package { Package::Bucket(p) => { let bucket_range = Range::between((p.bucket, 0, 0), (p.bucket + 1, 0, 0)); self.available_versions(&p.name) .filter(|v| bucket_range.contains(*v)) } ... }
If the package is a proxy however, there is one version per bucket in the target of the proxy.
#![allow(unused)] fn main() { match package { Package::Proxy { target, .. } => { bucket_versions(self.available_versions(&target)) } ... } /// Take a list of versions, and output a list of the corresponding bucket versions. /// So [1.1, 1.2, 2.3] -> [1.0, 2.0] fn bucket_versions( versions: impl Iterator<Item = SemVer> ) -> impl Iterator<Item = SemVer> { ... } }
Additionally, we can filter out buckets that are outside of the dependency range in the original dependency leading to that proxy package. Otherwise it will add wastefull computation to the solver, but we'll leave that out of this walkthrough.
The get_dependencies
method is slightly hairier to implement, so instead of all the code, we will just show the structure of the function in the happy path, with its comments.
#![allow(unused)] fn main() { fn get_dependencies( &self, package: &Package, version: &SemVer, ) -> Result<Dependencies<Package, SemVer>, ...> { let all_versions = self.packages.get(package.pkg_name()); ... match package { Package::Bucket(pkg) => { // If this is a bucket, we convert each original dependency into // either a dependency to a bucket package if the range is fully contained within one bucket, // or a dependency to a proxy package at any version otherwise. let deps = all_versions.get(version); ... let pkg_deps = deps.iter().map(|(name, range)| { if let Some(bucket) = single_bucket_spanned(range) { ... (Package::Bucket(bucket_dep), range.clone()) } else { ... (proxy, Range::any()) } }) .collect(); Ok(Dependencies::Known(pkg_deps)) } Package::Proxy { source, target } => { // If this is a proxy package, it depends on a single bucket package, the target, // at a range of versions corresponding to the bucket range of the version asked, // intersected with the original dependency range. let deps = all_versions.get(&source.1); ... let mut bucket_dep = Map::default(); bucket_dep.insert( Package::Bucket(Bucket { name: target.clone(), bucket: target_bucket, }), bucket_range.intersection(target_range), ); Ok(Dependencies::Known(bucket_dep)) } } } /// If the range is fully contained within one bucket, /// this returns that bucket identifier, otherwise, it returns None. fn single_bucket_spanned(range: &Range<SemVer>) -> Option<u32> { ... } }
That's all! The implementation also contains tests, with helper functions to build them. Here is the test corresponding to the example we presented above.
#![allow(unused)] fn main() { #[test] /// Example in guide. fn success_when_simple_version() { let mut index = Index::new(); index.add_deps("a", (1, 4, 0), &[("b", (1, 1, 0)..(2, 9, 0))]); index.add_deps("b", (1, 3, 0), &[("c", (1, 1, 0)..(1, 1, 1))]); index.add_deps("b", (2, 7, 0), &[("d", (3, 1, 0)..(3, 1, 1))]); index.add_deps::<R>("c", (1, 1, 0), &[]); index.add_deps::<R>("d", (3, 1, 0), &[]); assert_map_eq( &resolve(&index, "a#1", (1, 4, 0)).unwrap(), &select(&[("a#1", (1, 4, 0)), ("b#2", (2, 7, 0)), ("d#3", (3, 1, 0))]), ); } }
Implementing a dependency provider allowing both optional dependencies and multiple versions per package is left to the reader as an exercise (I've always wanted to say that).
Public and Private packages
We just saw in the previous section that it was possible to have multiple versions of the same package, as long as we create independant virtual packages called buckets. Oftentimes, creating buckets with fixed version boundaries and which are shared among the whole dependency system can cause versions mismatches in dependency chains. This is for example the case with a chain of packages publicly re-exporting types of their dependencies and interacting with each other. Concretely, let's say that we depend on two packages "a" and "b", with "a" also depending on "b" but at a different version.
- "root" depends on "a" @ 1
- "root" depends on "b" @ 1
- "a" @ 1 depends on "b" @ 2
Without buckets, there is no solution since "b" is depended on at two different versions. If we introduce buckets with boundaries at major versions, we have a solution since "root" depends on "b#1" while "a#1" depends on bucket "b#2".
But, what if package "a" re-exports in its public interface a type coming from its "b" dependency? If we feed a function of "a" with a type from "b#1", compilation will fail, complaining about the argument being of the wrong type. Currently in the rust compiler, this creates cryptic error messages of the kind "Expected type T, found type T", where "T" is exactly the same thing. Of course, those compiler errors should be improved, but there is a way of preventing that situation entirely at the time of solving dependencies instead of at compilation time. We can call that the public/private scheme. It consists in marking dependencies with re-exported types as "public", while other dependencies are considered "private" since they don't leak their types. So instead of saying that "a" depends on "b", we say that "a" publicly depends on "b". Therefore public dependencies must be unique even across multiple major versions.
Note that there is one inconvenience though, which is that we could have false positives, i.e. reject situations that the compiler would have accepted to compile. Indeed, it's not because "a" has public types of "b" exposed that we are necessarily using them! Now let's detail a bit more the public/private scheme and how it could be implemented with PubGrub.
Public subgraph
Two versions of a package can only conflict with each other if they interact through a chain of publicly connected dependencies. This means that any private dependency can cut the chain of public dependencies. If "a" privately depends on "b", dependencies of "b" are guaranteed to be independant (usage wise) of the rest of the dependency graph above package "a". This means that there is not one list of public packages, but rather multiple subgraphs of publicly connected packages, which start either at the root, or at a private dependency frontier. And in each public subgraph, there can only be one version of each package.
Seed markers
Since dependencies form a directed graph, each public subgraph can be uniquely identified by a root package, that we will call the "seed" of the public subgraph. This "seed" is in fact the source package of a private dependency link, and all publicly connected packages following the target package of the private dependency link can be marked with that seed. In addition, all packages behind a private link can only be accessed by the source package of that private dependency, so all seed markers existing previous to the private link can be cleared, leaving only seed marker of the source package. The diagram below provides a visual example of dependency graphs where seed markers are annotated next to each package.
In fact, as soon as a package has at least one private dependency, all dependency branches must be marked with the seed marker of the source package. This is because all branches contain code that is used by the source package. As a result, if a package has both a private dependency and a public dependency, the public dependency will inherit all markers of the source package plus a new seed marker for the source package itself. Therefore, the number of seed markers along a public dependency chain grows with the number of branches that also contain private dependencies, as visible in the diagram below.
Example
Let's consider the previous branching example where "b" is depended on both by our root package and by our dependency "a". If we note seed markers with a dollar symbol "$" that example can be adapted to the following system.
- "root" depends on "a$root" @ 1
- "root" depends on "b$root" @ 1
- "a$root" @ 1 depends privately on "b$a@1" @ 2
Seed markers must correspond to an exact package version because multiple versions of a given package will have different dependency graphs, and we don't want to wrongly assume that all subgraphs are the same for all versions. Here, since "a" depends privately on "b", "b" is marked with the seed "$a@1". Thus, this system has the following solution.
- "a$root" @ 1
- "b$root" @ 1
- "b$a@1" @ 2
If instead of a private dependency, "a" had a public dependency on "b", there would be no new seed marker and it would read:
- "a$root" @ 1 depends publicly on "b$root" @ 2
Leading to no solution, since the package "b$root" is now required both at version 1 and 2.
Example implementation
The seed markers scheme presented above can easily be implemented with pubgrub by keeping seed markers together with package names in the Package
type involved in the DependencyProvider
.
A complete example implementation of this extension allowing public and private dependencies is available in the public-private
crate of the advanced_dependency_providers
repository.
In that example, packages are of the type String
and versions of the type SemanticVersion
defined in pubgrub, which does not account for pre-releases, just the (Major, Minor, Patch) format of versions.
Defining an index of packages
Inside the index.rs
module, we define a very basic Index
, holding all packages known, as well as a helper function add_deps
easing the writing of tests.
#![allow(unused)] fn main() { /// Each package is identified by its name. pub type PackageName = String; /// Alias for dependencies. pub type Deps = Map<PackageName, (Privacy, Range<SemVer>)>; /// Privacy indicates if a dependency is public or private. pub enum Privacy { Public, Private } /// Global registry of known packages. pub struct Index { /// Specify dependencies of each package version. pub packages: Map<PackageName, BTreeMap<SemVer, Deps>>, } // Initialize an empty index. let mut index = Index::new(); // Add package "a" to the index at version 1.0.0 with no dependency. index.add_deps::<R>("a", (1, 0, 0), &[]); // Add package "a" to the index at version 2.0.0 with a private dependency to "b" at versions >= 1.0.0. index.add_deps("a", (2, 0, 0), &[("b", Private, (1, 0, 0)..)]); // Add package "a" to the index at version 3.0.0 with a public dependency to "b" at versions >= 1.0.0. index.add_deps("a", (3, 0, 0), &[("b", Public, (1, 0, 0)..)]); }
Implementing a dependency provider for the index
Since our Index
is ready, we now have to implement the DependencyProvider
trait for it.
As explained previously, we need to identify which public subgraphs a given dependency belongs to.
That is why each package also holds seed markers, which are the identifiers of the "seed" packages at the origin of each public subgraph this package belongs to.
Since we need a unique hash for each package for each seed, and there can be multiple seed markers, the PkgSeeds
type is actually an enum where a Markers
variant will have exactly one dependency to a Constraint
variant per seed listed in its markers, in addition to the dependencies registered in the index.
And as its name suggests, the Constraint
variant of a PkgSeeds
package is only there to make sure that the "1-version-per-seed" constraint is satisfied and does not have any dependency.
Thanks to that, we guarantee that there can only be one version of each package per public subgraph.
#![allow(unused)] fn main() { /// A package is identified by its name and by the public subgraphs /// it belongs to, themselves identified by "seeds". pub struct Package { name: String, seeds: PkgSeeds, } /// Each public subgraph is identified by a seed, /// and some packages belong to multiple public subgraphs /// and can thus have multiple seed markers. /// Since we also need to have a unique hash per package, per public subgraph, /// Each `Markers` variant of a package will also have a dependency on /// one `Constraint` variant per seed, resulting in one unique identifier /// per public subgraph that PubGrub can use to check constraints on versions. /// /// Remark: `Markers.pkgs` is just an implementation detail to prevent cycles /// with seeds of different versions and same package, /// which is impossible since we cannot pick two different versions /// of the same package in the same public subgraph. pub enum PkgSeeds { Constraint(Seed), Markers { seed_markers: Set<Seed>, pkgs: Set<String>, }, } /// A seed is the identifier associated with the private package /// at the origin of a public subgraph. pub struct Seed { /// Seed package identifier pub name: String, /// Seed version identifier pub version: SemVer, } }
Implementing choose_package_version
is trivial if we simply use the function choose_package_with_fewest_versions
provided by pubgrub.
Implementing get_dependencies
is slightly more complicated.
Have a look at the complete implementation if needed, the main ideas are the following.
#![allow(unused)] fn main() { fn get_dependencies(&self, package: &Package, version: &SemVer) -> Result<Dependencies<Package, SemVer>, ...> { match &package.seeds { // A Constraint variant does not have any dependency PkgSeeds::Constraint(_) => Ok(Dependencies::Known(Map::default())), // A Markers variant has dependencies to: // - one Constraint variant per seed marker // - one Markers variant per original dependency PkgSeeds::Markers { seed_markers, pkgs } => { // Seed constraints, one per seed for this package@version. let seed_constraints = ...; // Figure out if there are private dependencies. let has_private = ...; Ok(Dependencies::Known( // Chain the seed constraints with actual dependencies. seed_constraints .chain(index_deps.iter().map(|(p, (privacy, r))| { let seeds = if privacy == &Privacy::Private { // make a singleton seed package } else if has_private { // this is public but there is also a private dep, // so add the source package to the seed markers } else { // simply reuse the parent seed markers }; ( Package { name: p, seeds }, r ) })) .collect(), )) } } } }
Versions in a continuous space
The current design of pubgrub exposes a Version
trait demanding two properties, (1) that there exists a lowest version, and (2) that versions live in a discrete space where the successor of each version is known.
So versions are basically isomorph with N^n, where N is the set of natural numbers.
The successor design
There is a good reason why we started with the successor design for the Version
trait.
When building knowledge about the dependency system, pubgrub needs to compare sets of versions, and to perform common set operations such as intersection, union, inclusion, comparison for equality etc.
In particular, given two sets of versions S1 and S2, it needs to be able to answer if S1 is a subset of S2 (S1 ⊂ S2).
And we know that S1 ⊂ S2 if and only if S1 ∩ S2 == S1.
So checking for subsets can be done by checking for the equality between two sets of versions.
Therefore, sets of versions need to have unique canonical representations to be comparable.
We have the interesting property that we require Version
to have a total order.
As a consequence, the most adequate way to represent sets of versions with a total order, is to use a sequence of non intersecting segments, such as [0, 3] ∪ ]5, 9[ ∪ [42, +∞[
.
Notation: we note segments with close or open brackets depending on if the value at the frontier is included or excluded of the interval. It is also common to use a parenthesis for open brackets. So
[0, 14[
is equivalent to[0, 14)
in that other notation.
The previous set is for example composed of three segments,
- the closed segment [0, 3] containing versions 0, 1, 2 and 3,
- the open segment ]5, 9[ containing versions 6, 7 and 8,
- the semi-open segment [42, +∞[ containing all numbers above 42.
For the initial design, we did not want to have to deal with close or open brackets on both interval bounds.
Since we have a lowest version, the left bracket of segments must be closed to be able to contain that lowest version.
And since Version
does not impose any upper bound, we need to use open brackets on the right side of segments.
So our previous set thus becomes: [0, ?[ ∪ [?, 9[ ∪ [42, +∞[
.
But the question now is what do we use in place of the 3 in the first segment and in place of the 5 in the second segment.
This is the reason why we require the bump()
method on the Version
trait.
If we know the next version, we can replace 3 by bump(3) == 4, and 5 by bump(5) == 6.
We finally get the following representation [0, 4[ ∪ [6, 9[ ∪ [42, +∞[
.
And so the Range
type is defined as follows.
#![allow(unused)] fn main() { pub struct Range<V: Version> { segments: Vec<Interval<V>>, } type Interval<V> = (V, Option<V>); // set = [0, 4[ ∪ [6, 9[ ∪ [42, +∞[ let set = vec![(0, Some(4)), (6, Some(9)), (42, None)]; }
The bounded interval design
We may want however to have versions live in a continuous space. For example, if we want to use fractions, we can always build a new fraction between two others. As such it is impossible to define the successor of a fraction version.
We are currently investigating the use of bounded intervals to enable continuous spaces for versions. If it happens, this will only be in the next major release of pubgrub, probably 0.3. The current experiments look like follows.
#![allow(unused)] fn main() { /// New trait for versions. /// Bound is core::ops::Bound. pub trait Version: Clone + Ord + Debug + Display { /// Returns the minimum version. fn minimum() -> Bound<Self>; /// Returns the maximum version. fn maximum() -> Bound<Self>; } /// An interval is a bounded domain containing all values /// between its starting and ending bounds. /// RangeBounds is core::ops::RangeBounds. pub trait Interval<V>: RangeBounds<V> + Debug + Clone + Eq + PartialEq { /// Create an interval from its starting and ending bounds. /// It's the caller responsability to order them correctly. fn new(start_bound: Bound<V>, end_bound: Bound<V>) -> Self; } /// The new Range type is composed of bounded intervals. pub struct Range<I> { segments: Vec<I>, } }
It is certain though that the flexibility of enabling usage of continuous spaces will come at a performance price. We just have to evaluate how much it costs and if it is worth sharing a single implementation, or having both a discrete and a continuous implementation.
Pre-release versions
Pre-releasing is a very common pattern in the world of versioning. It is however one of the worst to take into account in a dependency system, and I highly recommend that if you can avoid introducing pre-releases in your package manager, you should. In the context of pubgrub, pre-releases break two fondamental properties of the solver.
- Pre-releases act similar to continuous spaces.
- Pre-releases break the mathematical properties of subsets in a space with total order.
(1) Indeed, it is hard to answer what version comes after "1-alpha0". Is it "1-alpha1", "1-beta0", "2"? In practice, we could say that the version that comes after "1-alpha0" is "1-alpha0?" where the "?" character is chosen to be the lowest character in the lexicographic order, but we clearly are on a stretch here and it certainly isn't natural.
(2) Pre-releases are often semantically linked to version constraints written by humans, interpreted differently depending on context.
For example, "2.0.0-beta" is meant to exist previous to version "2.0.0".
Yet, it is not supposed to be contained in the set described by 1.0.0 <= v < 2.0.0
, and only within sets where one of the bounds contains a pre-release marker such as 2.0.0-alpha <= v < 2.0.0
.
This poses a problem to the dependency solver because of backtracking.
Indeed, the PubGrub algorithm relies on knowledge accumulated all along the propagation of the solver front.
And this knowledge is composed of facts, that are thus never removed even when backtracking happens.
Those facts are called incompatibilities and more info about those is available in the "Internals" section of the guide.
The problem is that if a fact recalls that there is no version within the 1.0.0 <= v < 2.0.0
range, backtracking to a situation where we ask for a version within 2.0.0-alpha <= v < 2.0.0
will return nothing even without checking if a pre-release exists in that range.
And this is one of the fundamental mechanisms of the algorithm, so we should not try to alter it.
Point (2) is probably the reason why some pubgrub implementations have issues dealing with pre-releases when backtracking, as can be seen in an issue of the dart implementation.
Playing again with packages?
In the light of the "bucket" and "proxies" scheme we introduced in the section about allowing multiple versions per package, I'm wondering if we could do something similar for pre-releases.
Normal versions and pre-release versions would be split into two subsets, each attached to a different bucket.
In order to make this work, we would need a way to express negative dependencies.
For example, we would want to say: "a" depends on "b" within the (2.0, 3.0) range and is incompatible with any pre-release version of "b".
The tool to express such dependencies is already available in the form of Term
which can be a Positive
range or a Negative
one.
We would have to adjust the API for the get_dependencies
method to return terms instead of a ranges.
This may have consequences on other parts of the algorithm and should be thoroughly tested.
One issue is that the proxy and bucket scheme would allow for having both a normal and a pre-release version of the same package in dependencies. We do not want that, so instead of proxy packages, we might have "frontend" packages. The difference being that a proxy links a source to a target, while a frontend does not care about the source, only the target. As such, only one frontend version can be selected, thus followed by either a normal version or a pre-release version but not both.
Another issue would be that the proxy and bucket scheme breaks strategies depending on ordering of versions. Since we have two proxy versions, one targetting the normal bucket, and one targetting the pre-release bucket, a strategy aiming at the newest versions will lean towards normal or pre-release depending if the newest proxy version is the one for the normal or pre-release bucket. Mitigating this issue seems complicated, but hopefully, we are also exploring alternative API changes that could enable pre-releases.
Multi-dimensional ranges
We are currently exploring new APIs where Range
is transformed into a trait, instead of a predefined struct with a single sequence of non-intersecting intervals.
For now, the new trait is called RangeSet
and could be implemented on structs with multiple dimensions for ranges.
#![allow(unused)] fn main() { pub struct DoubleRange<V1: Version, V2: Version> { normal_range: Range<V1>, prerelease_range: Range<V2>, } }
With multi-dimensional ranges we could match the semantics of version constraints in ways that do not introduce alterations of the core of the algorithm.
For example, the constraint 2.0.0-alpha <= v < 2.0.0
could be matched to:
#![allow(unused)] fn main() { DoubleRange { normal_range: Range::none, prerelease_range: Range::between("2.0.0-alpha", "2.0.0"), } }
And the constraint 2.0.0-alpha <= v < 2.1.0
would have the same prerelease_range
but would have 2.0.0 <= v < 2.1.0
for the normal range.
Those constraints could also be intrepreted differently since not all pre-release systems work the same.
But the important property is that this enable a separation of the dimensions that do not behave consistently with regard to the mathematical properties of the sets manipulated.
All this is under ongoing experimentations, to try reaching a sweet spot API-wise and performance-wise. If you are eager to experiment with all the extensions and limitations mentionned in this section of the guide for your dependency provider, don't hesitate to reach out to us in our zulip stream or in GitHub issues to let us know how it went!
Internals of the PubGrub algorithm
For an alternative / complementary explanation of the PubGrub algorithm, you can read the detailed description of the solver provided by the original PubGrub author in the GitHub repository of the dart package manager pub.
PubGrub is an algorithm inspired by conflict-driven nogood learning (CDNL-ASP), an approach presented by Gabser, Kaufmann and Schaub in 2012. The approach consists in iteratively taking decisions (here picking a package and version) until reaching a conflict. At that point it records a nogood (an "incompatibility" in PubGrub terminology) describing the root cause of the conflict and backtracks to a state previous to the decision leading to that conflict. CDNL has many similarities with CDCL (conflict-driven clause learning) with the difference that nogoods are conjunctions while clauses are disjunctions of literals. More documentation of their approach is available on their website.
At any moment, the PubGrub algorithm holds a state composed of two things, (1) a partial solution and (2) a set of incompatibilities. The partial solution (1) is a chronological list of "assignments", which are either decisions taken or version constraints for packages where no decision was made yet. The set of incompatibilities (2) is an ever-growing collection of incompatibilities. We will describe them in more details later but simply put, an incompatibility describes packages that are dependent or incompatible, that is packages that must be present at the same time or that cannot be present at the same time in the solution.
Incompatibilities express facts, and as such are always valid. Therefore, the set of incompatibilities is never backtracked, only growing and recording new knowledge along the way. In contrast, the partial solution contains decisions and deductions (called "derivations" in PubGrub terminology), that are dependent on every decision made. Therefore, PubGrub needs to be able to backtrack the partial solution to an older state when there is a conflict.
Overview of the algorithm
Solver main loop
The solver runs in a loop with the following steps:
- Perform unit propagation on the currently selected package.
- Make a decision: pick a new package and version compatible with the current state of the partial solution.
- Retrieve dependencies for the newly selected package and transform those into incompatibilities.
At any point within the loop, the algorithm may fail due to an impossibility to solve a conflict or an error occuring while trying to retrieve dependencies. When there is no more decision to be made, the algorithm returns the decisions from the partial solution.
Unit propagation overview
Unit propagation is the core mechanism of the algorithm. For the currently selected package, unit propagation aims at deriving new constraints (called "terms" in PubGrub and "literals" in CDNL terminology), from all incompatibilities referring to that package. For example, if an incompatibility specifies that packages a and b are incompatible, and if we just made a decision for package a, then we can derive a term specifying that package b should not appear in the solution.
While browsing incompatibilities, we may stumble upon one that is already "satisfied" by the current state of the partial solution. In our previous example, that would be the case if we had previously already made a decision for package b (in practice that exact situation could not happen but let's leave that subtlety for later). If an incompatibility is satisfied, we call that a conflict and must perform conflict resolution to backtrack the partial solution to a state previous to that conflict. Details on conflict resolution are presented in its dedicated section.
Terms
Dependency solving is a complex task where intuition may not be correct or scale to complex situations.
We therefore anchor this task in sound mathematics and logic primitive.
In turn, this demands precise notation and translations between
the primitives of an application layer (Cargo.toml
, requirements.txt
, ...)
and those of the underlying mathematics (CDCL, CDNL, ...).
We introduce some of that notation here.
Definition
For any given package, we describe a set of possible versions as a range. For example, a range may correspond to a single version, the set of versions higher than 2 and lower than 5, the set containing every possible version or even the set containing no version at all. And once we have ranges, we can build terms, defined as either positive or negative ranges.
#![allow(unused)] fn main() { pub enum Term<V> { Positive(Range<V>), Negative(Range<V>), } }
A term is a literal evaluated either true or false depending on the evaluation context. A positive term is evaluated true if and only if a version contained in its range was selected. A negative term is evaluated true if and only if a version not contained in its range was selected, or no version was selected.
Remark on negative terms:
A negative term is different than the positive one with its complement range. Indeed, saying "a version was picked in the complement of this range", is different than saying "no version was picked or a version was picked in the complement of this range".
In this guide, for any given range \(r\), we will note \([r]\) its associated positive term, and \(\neg [r]\) its associated negative term. And for any term \(T\), whether positive or negative range, we will note \(\overline{T}\) the negation of that term, transforming a positive range into an negative one and vice versa. Therefore we have the following notation rules,
\[\begin{eqnarray} \overline{[r]} &=& \neg [r], \nonumber \\ \overline{\neg [r]} &=& [r]. \nonumber \\ \end{eqnarray}\]
Remark on notation:
To keep the notation precise, we only use the \(\neg\) symbol in front of ranges, such as \(\neg[r]\). So we know when seeing it that we are talking about a negative term. In contrast, we use the overline symbol generically, so we can't make the assumption that the presence, or absence, of overline describes a positive or negative term. So for a generic term \(T\), we will not write \(\neg T\), but \(\overline{T}\), and for a range \(r\), we will not write \(\neg \neg [r]\), but \(\overline{\neg [r]}\) which is also directly \([r]\).
Special terms
Provided that we have defined an empty range of versions \(\varnothing\), there are two terms with special behavior which are \([\varnothing]\) and \(\neg [\varnothing]\). By definition, \([\varnothing]\) is evaluated true if and only if a version contained in the empty range is selected. This is impossible and as such the term \([\varnothing]\) is always evaluated false. And by negation, the term \(\neg [\varnothing]\) is always evaluated true.
Intersection of terms
We define the "intersection" of two terms as the conjunction of those two terms (a logical AND). Therefore, if any of the terms is positive, the intersection also is a positive term. Given any two ranges of versions \(r_1\) and \(r_2\), the intersection of terms based on those ranges is defined as follows,
\[\begin{eqnarray} [r_1] \land [r_2] &=& [r_1 \cap r_2], \nonumber \\ [r_1] \land \neg [r_2] &=& [r_1 \cap r_2^C], \nonumber \\ \neg [r_1] \land \neg [r_2] &=& \neg [r_1 \cup r_2]. \nonumber \\ \end{eqnarray}\]
Where \(\cap\) and \(\cup\) are the notations for set intersections and unions of version ranges, and \(r_2^C\) is the complement set of the one defined by the range \(r_2\).
For any two terms \(T_1\) and \(T_2\), their union and intersection are related by De Morgan's rule:
\[ \overline{T_1 \lor T_2} = \overline{T_1} \land \overline{T_1}. \]
Relation between terms
We say that a term \(T_1\) is satisfied by another term \(T_2\) if \(T_2\) implies \(T_1\), i.e. when \(T_2\) is evaluated true then \(T_1\) must also be true. This is equivalent to saying that \(T_2\) is a subset of \(T_1\), which is verified if \( T_2 \land T_1 = T_2 \).
Note on comparability of terms:
Checking if a term is satisfied by another term is accomplished in the code by verifying if the intersection of the two terms equals the second term. It is thus very important that terms have unique representations, and by consequence also that ranges have a unique representation.
In the provided
Range
type, ranges are implemented as an ordered vec of non-intersecting semi-open intervals. It is thus important that they are always reduced to their canonical form. For example, the range2 <= v < 2
is actually empty and thus should not be represented byvec![(2, Some(2))]
but by the emptyvec![]
. This requires special care when implementing range intersection.
Incompatibilities
Definition
Incompatibilities are called "nogoods" in CDNL-ASP terminology. An incompatibility is a conjunction of package terms that must be evaluated false, meaning at least one package term must be evaluated false. Otherwise we say that the incompatibility has been "satisfied". Satisfied incompatibilities represent conflicts and thus the goal of the PubGrub algorithm is to build a solution such that none of the produced incompatibilities are ever satisfied. If one incompatibility becomes satisfied at some point, the algorithm finds the root cause of it and backtracks the partial solution before the decision at the origin of that root cause.
Remark: incompatibilities (nogoods) are the opposite of clauses in traditional conflict-driven clause learning (CDCL) which are disjunctions of literals that must be evaluated true, so have at least one literal evaluated true.
The gist of CDCL is that it builds a solution to satisfy a conjunctive normal form (conjunction of clauses) while CDNL builds a solution to unsatisfy a disjunctive normal form (disjunction of nogoods).
In addition, PubGrub is a lazy CDNL algorithm since the disjunction of nogoods (incompatibilities) is built on the fly with the solution.
In this guide, we will note incompatibilities with curly braces. An incompatibility containing one term \(T_a\) for package \(a\) and one term \(T_b\) for package \(b\) will be noted
\[ \{ a: T_a, b: T_b \}. \]
Remark: in a more "mathematical" setting, we would probably have noted \( T_a \land T_b \), but the chosen notation maps well with the representation of incompatibilities as hash maps.
Properties
Packages only appear once in an incompatibility. Since an incompatibility is a conjunction, multiple terms for the same package are merged with the intersection of those terms.
Terms that are always satisfied can be removed from an incompatibility. We previously explained that the term \( \neg [\varnothing] \) is always evaluated true. As a consequence, it can safely be removed from the conjunction of terms that is the incompatibility.
\[ \{ a: T_a, b: T_b, c: \neg [\varnothing] \} = \{ a: T_a, b: T_b \} \]
Dependencies can be expressed as incompatibilities. Saying that versions in range \( r_a \) of package \( a \) depend on versions in range \( r_b \) of package \( b \) can be expressed by the incompatibility
\[ \{ a: [r_a], b: \neg [r_b] \}. \]
Unit propagation
If all terms but one of an incompatibility are satisfied by a partial solution, we can deduce that the remaining unsatisfied term must be evaluated false. We can thus derive a new unit term for the partial solution which is the negation of the remaining unsatisfied term of the incompatibility. For example, if we have the incompatibility \( \{ a: T_a, b: T_b, c: T_c \} \) and if \( T_a \) and \( T_b \) are satisfied by terms in the partial solution then we can derive that the term \( \overline{T_c} \) can be added for package \( c \) in the partial solution.
Rule of resolution
Intuitively, we are able to deduce things such as if package \( a \) depends and package \( b \) which depends on package \( c \), then \( a \) depends on \( c \). With incompatibilities, we would note
\[ \{ a: T_a, b: \overline{T_b} \}, \quad \{ b: T_b, c: \overline{T_c} \} \quad \Rightarrow \quad \{ a: T_a, c: \overline{T_c} \}. \]
This is the simplified version of the rule of resolution. For the generalization, let's reuse the "more mathematical" notation of conjunctions for incompatibilities \( T_a \land T_b \) and the above rule would be
\[ T_a \land \overline{T_b}, \quad T_b \land \overline{T_c} \quad \Rightarrow \quad T_a \land \overline{T_c}. \]
In fact, the above rule can also be expressed as follows
\[ T_a \land \overline{T_b}, \quad T_b \land \overline{T_c} \quad \Rightarrow \quad T_a \land (\overline{T_b} \lor T_b) \land \overline{T_c} \]
since for any term \( T \), the disjunction \( T \lor \overline{T} \) is always true. In general, for any two incompatibilities \( T_a^1 \land T_b^1 \land \cdots \land T_z^1 \) and \( T_a^2 \land T_b^2 \land \cdots \land T_z^2 \) we can deduce a third, called the resolvent whose expression is
\[ (T_a^1 \lor T_a^2) \land (T_b^1 \land T_b^2) \land \cdots \land (T_z^1 \land T_z^2). \]
In that expression, only one pair of package terms is regrouped as a union (a disjunction), the others are all intersected (conjunction). If a term for a package does not exist in one incompatibility, it can safely be replaced by the term \( \neg [\varnothing] \) in the expression above as we have already explained before.
Partial solution
The partial solution is the part of PubGrub state holding all the decisions taken and the derivations computed from unit propagation on almost satisfied incompatibilities. We regroup decisions and derivations under the term "assignment".
The partial solution must be backtrackable when conflicts are detected. For this reason all assignments are recorded with their associated decision level. The decision level of an assignment is a counter for the number of decisions that have already been taken (including that one if it is a decision). If we represent all assignments as a chronological vec, they would look like follows:
[ (0, root_derivation),
(1, root_decision),
(1, derivation_1a),
(1, derivation_1b),
...,
(2, decision_2),
(2, derivation_2a),
...,
]
The partial solution must also enable efficient evaluation of incompatibilities in the unit propagation loop. For this, we need to have efficient access to all assignments referring to the packages present in an incompatibility.
To enable both efficient backtracking and efficient access to specific
package assignments, the current implementation holds a dual representation
of the the partial solution.
One is called history
and keeps dated (with decision levels) assignments
in an ordered growing vec.
The other is called memory
and organizes assignments in a hashmap
where they are regrouped by packages which are the hashmap keys.
It would be interresting to see how the partial solution is stored
in other implementations of PubGrub such as the one in dart pub.
Conflict resolution
As stated before, a conflict is a satisfied incompatibility that we detected in the unit propagation loop. The goal of conflict resolution is to backtrack the partial solution such that we have the following guarantees:
- The root cause incompatibility of the conflict is almost satisfied (such that we can continue unit propagation).
- The following derivations will be different than before conflict resolution.
Let the "satisfier" be the earliest assignment in the partial solution making the incompatibility fully satisfied by the partial solution up to that point. We know that we want to backtrack the partial solution at least previous to that assignment. Backtracking only makes sense if done at decision levels frontiers. As such the conflictual incompatibility can only become "almost satisfied" if there is only one package term related to incompatibility satisfaction at the decision level of that satisfier. When the satisfier is a decision this is trivial since all previous assignments are of lower decision levels. When the satisfier is a derivation however we need to check that property. We do that by computing the "previous satisfier" decision level. The previous satisfier is (if it exists) the earliest assignment previous to the satisfier such that the partial solution up to that point, plus the satisfier, makes the incompatibility satisfied. Once we found it, we know that property (1) is guaranteed as long as we backtrack to a decision level between the one of the previous satisfier and the one of the satisfier, as long as these are different.
If the satisfier and previous satisfier decisions levels are the same, we cannot guarantee (1) for that incompatibility after backtracking. Therefore, the key of conflict resolution is to derive a new incompatibility for which we will be able to guarantee (1). And we have seen how to do that with the rule of resolution. We will derive a new incompatibility called the "prior cause" as the resolvent of the current incompatibility and the incompatibility which is the cause of the satisfier. If necessary, we repeat that procedure until finding an incompatibility, called the "root cause" for which we can guarantee that it will be almost satisfied after backtracking (1).
Now the question is where do we cut? Is there a reason we cut at the previous satisfier decision level? Is it to guarantee (2)? Would that not be guaranteed if we picked another decision level? Is it because backtracking further back will reduce the number of potential conflicts?
Building a report tree
Terminal incompatibility
Whenever we build an incompatibility that will always be satisfied, version solving has failed. The empty incompatibility, for example is an incompatibility for which terms of all packages are \( \neg [\varnothing] \) and thus all terms are satisfied. Another terminal incompatibility is an incompatibility containing a single package term, satisfied by picking the package and version at the root of the solution, the one for which we want to resolve dependencies.
Derivation tree to report tree
Incompatibilities are either external or derived. External incompatibilities are the one expressing facts independent of the solver deductions capabilities, such as dependencies, unavailable dependencies, missing versions etc. In contrast, derived incompatibilities are the one obtained through the rule of resolution when computing prior causes. Every derived incompatibility keeps a reference to the two incompatibilities that were used to derive it. As a consequence, a chain of derived incompatibilities defines a binary tree, that we call the derivation tree.
When version solving failed, it is thus possible to take the derivation tree of the terminal incompatibility to build a complete explanation of why it failed. That derivation tree however is using incompatibilities whose shape are dependent on internal implementation details. The purpose of the report tree is then to transform the derivation tree into a data structure easily usable for reporting.
Report tree type
In order to provide information in the most detailed way, the report tree uses enum types that try to be as self-explanatory as possible. I'm not sure the current design, based on a recursive (boxed) enum is the best one but it has the advantage of presenting the binary report tree in a very straightforward manner, easy to process.
Duplicate nodes
Though it has the shape of a binary tree, and can be represented as a binary tree, the derivation tree is actually a derivation graph. Indeed, even if every derived incompatibility was built from two others, some incompatibilities may have been used to derive multiple new incompatibilities.
TODO: ascii representations similar to the ones in the error reporting section of pub solver.md.
There is still much API exploration work to be done on error reporting.
Testing and benchmarking
Any non-trivial software is flawed and it is a programmer's goal to make it correct, readable and fast. Testing helps getting correct programs, and benchmarking helps making them faster. In this section we present the approach and tools used to test and benchmark pubgrub.
Note on reproducibility:
"Insanity is doing the same thing over and over again, but expecting different results". Einstein probably didn't came up with that one but this is still very much the definition of non-reproducibility, and it can drive us mad when chasing heisenbugs. For this reason we try to avoid everything that would make pubgrub non reproducible, such that every failed test can be reproduced and fixed.
Property testing
To test pubgrub, we employ a mix of unit tests, integration tests and property tests.
Some unit tests are present for example in the version
module to validate
that the parsing of semantic versions is correct.
Integration tests are located in the tests/
directory.
Property tests are co-located both with unit tests and integration tests
depending on required access to some private implementation details.
Examples
We have multiple example cases inside tests/examples.rs
.
Those mainly come from dart documentation of the solver
and are simple end-to-end examples for which we know the expected results.
The first example called no_conflict
is a simple case where
the root package depends on one package which itself has a dependency
to another package.
The tests there compare the expected result with the solution of the solver
for each of those examples.
These were very useful when making progress on PubGrub implementation
and can now also be referred to as example usage of the API.
Unit property testing
Property testing consists in defining invariants,
generating random valid inputs for the tested functions,
and verifying that the invariants hold for the results.
Here are some examples extracted from the range
module.
#![allow(unused)] fn main() { proptest! { #[test] fn negate_is_different(range in strategy()) { assert_ne!(range.negate(), range); } #[test] fn double_negate_is_identity(range in strategy()) { assert_eq!(range.negate().negate(), range); } #[test] fn negate_contains_opposite(range in strategy(), version in version_strat()) { assert_ne!(range.contains(&version), range.negate().contains(&version)); } } }
As you can see, the input of the testing function is specified
in an unusual manner, using a function called a "strategy".
This is the terminology used by the proptest crate
and it simply is a way to describe how are randomly generated the values
used as inputs to those property tests.
Don't hesitate to have a look at the corresponding strategy()
function
defined just above the extracted code if you want to know more about that.
End-to-end property testing
Defining and testing properties for small units of code
like ranges implementation is rather easy.
Coming up with interesting properties for end-to-end testing
such as results of full resolution is different.
But the most difficult part is probably finding a "strategy"
to generate random but valid registries of packages and dependencies.
This is the work that has been done in tests/proptest.rs
for the registry_strategy()
function.
The simplest implementation of registry_strategy()
would be
any::<Map<P, Map<V, Map<P, Range<V>>>>()
, with some machinery to convert to an OfflineDependencyProvider
.
While this works, it would almost certainly generate unsolvable dependencies.
Indeed random package names in dependencies have an almost null probability
of being the same than existing package names.
Let's defer picking the dependencies until we have a list of available names.
Luckily proptest has an Index
type for this use case of picking something out of a list that has not yet been generated.
This leaves us with any::Map<P, Map<V, Map<Index, Range<V>>>()
,
with more machinery to get the P
for each index.
While this works, the generated versions have very low probability
of being the same than existing versions for a given package.
Let's also use Index to ensure our Range
s are relevant.
To identify each Index we will name Ip
the one for package names,
and Ia
and Ib
the ones to define version ranges.
This leaves us with any::Map<P, Map<V, Map<Ip, (Ia, Ib)>>()
.
The conversion from Map<Ip, (Ia, Ib)>
to Map<P, Range<V>>
is done by first picking
a dependency P
using Ip
and then picking up two versions for that package
with the Ia
and Ib
indices.
We can then call Range::between(versions[Ia], versions[Ib].bump())
to build a range.
If versions[Ib] < versions[Ia]
, we use Range::between(versions[Ib], versions[Ia].bump())
instead.
Now we finally have something that makes interesting registries! But not particularly realistic ones since almost all of them end up with packages that indirectly depend on themselves. PubGrub is designed to solve dependencies with at most one version per package, so that kind of circular dependencies will very often end up unsolvable.
To fix this lets make sure that we have a DAG,
by having each package only depend on packages with a lower index.
One solution is to represent dependencies in a side list Vec<(Ip, Id, (Ia, Ib))>
.
If Ip < Id
then we switch them to maintain the acyclic nature of the graph.
This is currently how we generate registries of dependencies,
suggestions to improve are welcome!
Generating random registries of packages may produce cases
where dependency resolution would take too long.
For this reason, we introduced in the DependencyProvider
trait definition
a function called should_cancel
which is called in the solver loop.
By default it does nothing but can be overwritten such as
in TimeoutDependencyProvider
(defined in tests/proptest.rs
),
where the solver is stopped after a certain amount of time.
Once all this is setup, we have to come up with good properties. Here are some of these:
- The solver should return the same result on multiple runs with the same input. That may seem trivial, but thinks like hashmaps do add some randomness. So that test ensures that we configured properly everything that could prevent reproducibility of the solver.
- Changing the order in which valid package versions are tried should not change the existence of a solution or not. Indeed, some freedom is available for the dependency provider to pick which package and version to choose next. We must ensure that it cannot change the existence of solution for our implementation of the solver algorithm.
- Removing a dependency cannot prevent existence of a solution. If a solution was found in a given situation, removing a dependency cannot get us in a situation where the solver does not find a solution anymore. Only adding dependencies should impact that.
- Removing a package that does not appear in a solution cannot break that solution. Just as before, it should not impact the existence of a solution.
Comparison with a SAT solver
The SAT problem aims at answering if there exist a set of Boolean variables assignments satisfying a given Boolean expression. So if we can formulate dependencies as a Boolean expression, we can use these well tested tools to compare the result of pubgrub with the one of a SAT solver. SAT solvers usually work with Boolean expressions in a conjunctive normal form (CNF) meaning a conjunction of clauses, an "AND of ORs". So our goal is to convert the following rules of dependency solving into a set of clauses.
- There must be only one version per package.
- Dependencies of a package must be present if that package is present.
- The root package must be present in the solution.
Each package version \((p,v)\) is associated to a unique Boolean variable
\(b_{p,v}\) that can be true
or false
.
For a given package \(p\) and its set of existing versions \(V_p\),
the property (1) can be expressed by the set of clauses
\[ \{ \neg b_{p,v_1} \lor \neg b_{p,v_2} \ | \ (v_1, v_2) \in V_p^2, v_1 \neq v_2 \}. \]
Each one of these clauses specifies that \(v_1\) or \(v_2\) is not present
in the solution for package \(p\).
These clauses are generated by the sat_at_most_one
function in the testing code.
A more efficent approach is used when a package has 4 or more versions
but the idea is the same.
For a given package version \((p,v)\) depending on another package \(d\)
at versions \((v_1, v_2, \cdots, v_k)\), the property (2) can be encoded by the clause
\[ \neg b_{p,v} \lor b_{d,v_1} \lor b_{d,v_2} \lor \cdots \lor b_{d,v_k} \]
specifying that either \((p,v)\) is not present in the solution, or at least one of versions \((v_1, v_2, \cdots, v_k)\) of package \(d\) is present in the solution. Finally, the last constraint of version solving (3) is encoded with the following unit clause,
\[ b_{r,v_r} \]
where \((r, v_r)\) is the root package version for which we are solving dependencies.
Once the conversion to the SAT formulation is done, we can combine it with property testing, and verify the following properties:
- If pubgrub finds a solution, the SAT solver is also satisfied by that solution.
- If pubgrub does not find a solution, neither does the SAT solver.
Benchmarking
Performance optimization is a tedious but very rewarding practice if done right. It requires rigor and sometime arcane knowledge of lol level details. If you are interested in performance optimization for pubgrub, we suggest reading first The Rust Performance Book
Side note about code layout and performance
Changing your username has an impact on the performances of your code. This is not clickbait I promess, but simply an effect of layout changes. Se before making any assumption on performance improvement or regression try making sure that measures are actually reflecting the intent of the code changes and not something else. It was shown that code layout changes can produce ±40% in performance changes.
I highly recommend watching the talk "Performance Matters" by Emery Berger presented at strangeloop 2019 for more information on "sound performance analysis" and "effective performance profiling". There is an issue in criterion about the "stabilizer" tool. And for causal profiling, it seems possible to directly use coz for rust programs.
How can I contribute? Here are some ideas
-
Use it! Indeed there is quite some work left for custom dependency providers. So just using the crate, building you own dependency provider for your favorite programming language and letting us know how it turned out would be amazing feedback already!
-
Non failing extension for multiple versions. Currently, the solver works by allowing only one version per package. In some contexts however, we may want to not fail if multiple versions are required, and return instead multiple versions per package. Such situations may be for example allowing multiple major versions of the same crate. But it could be more general than that exact scenario.