Tuesday, April 19, 2016

Managing the state of mutable components in NixOS configurations with Dysnomia


In an old blog post (and research paper) from a couple of years ago, I have described a prototype version of Dysnomia -- a toolset that can be used to deploy so-called "mutable components". In the middle of last year, I have integrated the majority of its concepts into the mainstream version of Dysnomia, because I had found some practical use for it.

So far, I have only used Dysnomia in conjunction with Disnix -- Disnix executes all activities required to deploy a service-oriented system, such as:

  • Building services and their intra-dependencies from source code. By default, Disnix performs the builds on the coordinator machine, but can also optionally delegate them to target machines in the network.
  • Distributing services and their intra-dependency closures to the appropriate target machines in the network.
  • Activating newly deployed services, and deactivating obsolete services.
  • Optionally snapshotting, transferring and restoring the state of services (or a subset of services) that have moved from a target machine to another.

For carrying out the building and distribution activities, Disnix invokes the Nix package manager as it provides a number of powerful features that makes deployment of packages more reliable and reproducible.

However, not all activities required to deploy service-oriented systems are supported by Nix and this is where Dysnomia comes in handy -- one of Dysnomia's objectives is to uniformly activate and deactivate mutable components in containers by modifying the latter's state. The other objective is to uniformly support snapshotting and restoring the state of mutable components deployed in a container.

The definitions of mutable components and containers are deliberately left abstract in a Dysnomia context. Basically, they can represent anything, such as:

  • A MySQL database schema component and a MySQL DBMS container.
  • An Java web application component (WAR file) and an Apache Tomcat container.
  • A UNIX process component and a systemd container.
  • Even NixOS configurations can be considered mutable components.

To support many kinds of component and container flavours, Dysnomia has been designed as a plugin system -- each Dysnomia module has a standardized interface (basically a process taking two standard command line parameters) and implement a set of standard deployment activities (e.g. activate, deactivate, snapshot and restore) for each type of container.

Despite the fact that Dysnomia has originally been designed for use with Disnix (the package was historically known as Disnix activation scripts), it can also be used a standalone tool or in combination with other deployment solutions. (As a sidenote: the reason why I picked the name Dysnomia is, because like Nix, it is the name of a moon of a Trans-Neptunian object).

Similar to Disnix, when deploying NixOS configurations, all activities to deploy the static parts of a system are carried out by the Nix package manager.

However, in the final step (the activation step) a big generated shell script is executed that is responsible for deploying the dynamic parts of a system, such as the updating the GRUB bootloader, reloading systemd units, creating folders that store variable data (e.g. /var), creating user accounts and so on.

In some cases, it may also be desired to deploy mutable components as part of a NixOS system configuration:

  • Some systems are monolithic and cannot be be decomposed into services (i.e. distributable units) of deployment.
  • Some NixOS modules have scripts to initialize the state of a system service on first startup, such as a database, but do it in their own ad-hoc way, e.g. there is no real formalism behind it.
  • You may also want to use Dysnomia's (primitive) snapshotting facilities for backup purposes.

Recently I did some interesting experiments with Dysnomia on NixOS-level. In this blog post, I will show how Dysnomia can be used in conjunction with NixOS.

Deploying NixOS configurations


As described in earlier blog posts, in NixOS, deployment is driven by a single NixOS configuration file (/etc/nixos/configuration.nix), such as:

{pkgs, ...}:

{
  boot.loader.grub = {
    enable = true;
    device = "/dev/sda";
  };

  fileSystems."/" = {
    device = "/dev/disk/by-label/nixos";
    fsType = "ext4";  
  };

  services = {
    openssh.enable = true;
    
    mysql = {
      enable = true;
      package = pkgs.mysql;
      rootPassword = ../configurations/mysqlpw;
    };
  };
}

The above configuration file states that we want to deploy a system using the GRUB bootloader, having a single root partition, running OpenSSH and MySQL as system services. The configuration can be deployed with a single-command line instruction:

$ nixos-rebuild switch

When running the above command-line instruction, the Nix package manager deploys all required packages and configuration files. After all packages have been successfully deployed, the activation script gets executed. As a result, we have a system running OpenSSH and MySQL.

By modifying the above configuration and adding another service after MySQL:

...

mysql = {
  enable = true;
  package = pkgs.mysql;
  rootPassword = ../configurations/mysqlpw;
};

tomcat = {
  enable = true;
  commonLibs = [ "${pkgs.mysql_jdbc}/share/java/mysql-connector-java.jar" ];
  catalinaOpts = "-Xms64m -Xmx256m";
};

...

and running the same command-line instruction again:

$ nixos-rebuild switch

The NixOS configuration gets upgraded to also run Apache Tomcat as a system service in addition to MySQL and OpenSSH. When upgrading, Nix only builds or downloads the packages that have not been deployed before making the upgrade process much more efficient than rebuilding it from scratch.

Managing collections of mutable components


Similar to NixOS configurations (that represent entire system configurations), we need to manage the deployment of mutable components belonging to a system configuration as a whole. I have developed a new tool called: dysnomia-containers for this purpose.

The following command-line instruction queries all available containers on a system that serve as potential deployment targets:

$ dysnomia-containers --query-containers
mysql-database
process
tomcat-webapplication
wrapper

What the above command-line instruction does is searching all folders in the DYSNOMIA_CONTAINERS_PATH environment variable (that defaults to: /etc/dysnomia/containers) for container configuration files and displays their names, such as mysql-database corresponding to a MySQL DBMS server, and process and wrapper that are virtual containers integrating with the host system's service manager, such as systemd.

We can also query the available mutable components that we can deploy to the above listed containers:

$ dysnomia-containers --query-available-components
mysql-database/rooms
mysql-database/staff
mysql-database/zipcodes
tomcat-webapplication/GeolocationService
tomcat-webapplication/RoomService
tomcat-webapplication/StaffService
tomcat-webapplication/StaffTracker
tomcat-webapplication/ZipcodeService

The above command-line instruction displays all the available mutable component configurations that reside in directories provided by the DYSNOMIA_COMPONENTS_PATH environment variable, such as three MySQL databases and five Apache Tomcat web applications.

We can deploy all the available mutable components to the available containers, by running:

$ dysnomia-containers --deploy
Activating component: rooms in container: mysql-database
Activating component: staff in container: mysql-database
Activating component: zipcodes in container: mysql-database
Activating component: GeolocationService in container: tomcat-webapplication
Activating component: RoomService in container: tomcat-webapplication
Activating component: StaffService in container: tomcat-webapplication
Activating component: StaffTracker in container: tomcat-webapplication
Activating component: ZipcodeService in container: tomcat-webapplication

Besides displaying the available mutable components and deploying them, we can also query which ones have been deployed already:

$ dysnomia-containers --query-activated-components
mysql-database/rooms
mysql-database/staff
mysql-database/zipcodes
tomcat-webapplication/GeolocationService
tomcat-webapplication/RoomServiceWrapper
tomcat-webapplication/StaffService
tomcat-webapplication/StaffTracker
tomcat-webapplication/ZipcodeService

The dysnomia-containers tool uses the set of available and activated components to make an upgrade more efficient -- when deploying a new system configuration, it will deactivate the components that have been activated that are not available anymore, and activate the available components that have not been activated yet. The components that are both in the old and new configuration remain untouched.

For example, if we would run dysnomia-containers --deploy again, then nothing will be deployed or undeployed as the configuration remained identical.

We can also take snapshots of all activated mutable components (for example, for backup purposes):

$ dysnomia-containers --snapshot

After running the above command, the Dysnomia snapshot utility may show you the following output:

$ dysnomia-snapshots --query-all
mysql-database/rooms/faede34f3bf658884020a31ca98f16503da9a90bf3313cc96adc5c2358c0b054
mysql-database/staff/e9af7042064c33379ba9fe9272f61986b5a85de63c57732f067695e499a3a18f
mysql-database/zipcodes/637faa3e79ec6c2db71ac4023e86f29890e54233ea6592680fd88481725d44a3

As may be noticed, for each MySQL database (we have three of them) we have taken a snapshot. (For the Apache Tomcat web applications, no snapshots have been taken because state management for these kinds of components is unsupported).

We can also restore the state from the snapshots that we just have taken:

$ dysnomia-containers --restore

The above command restores the state of all three databases.

Finally, as with services deployed by Disnix, deactivating a mutable component does not imply that its state is removed automatically. Instead, it has been marked as garbage and must be explicitly removed by running:

$ dysnomia-containers --collect-garbage

NixOS integration


To actually make the previously shown deployment activities work, we need configuration files for all the containers and mutable components and put them into locations that are reachable from the DYSNOMIA_CONTAINERS_PATH and DYSNOMIA_COMPONENTS_PATH environment variables.

Obviously, they can be written by hand (as demonstrated in my previous blog post about Dysnomia), but this is not always very practical to do on a system-level. Moreover, there is some repetition involved as a NixOS configuration and container configuration files capture common properties.

I have developed a Dysnomia NixOS module to automate Dysnomia's configuration through NixOS. It can be enabled by adding the following property to a NixOS configuration file:

dysnomia.enable = true;

We can specify container properties in a NixOS configuration file as follows:

dysnomia.containers = {
  mysql-database = {
    mysqlUsername = "root";
    mysqlPassword = "secret";
    mysqlPort = 3306;
  };
  tomcat-webapplication = {
    tomcatPort = 8080;
  };
  ...
};

The Dysnomia module generates the corresponding container configuration files having the same names as each attribute name in the dysnomia.containers set and composes their contents from the sub attribute sets by translating them to text files with key=value pairs.

Most of the dysnomia.containers properties can be automatically generated by the Dysnomia NixOS module as well, since most of them have already been specified elsewhere in a NixOS configuration. For example, by enabling MySQL in a Dysnomia-enabled NixOS configuration:

services.mysql = {
  enable = true;
  package = pkgs.mysql;
  rootPassword = ../configurations/mysqlpw;
};

The Dysnomia module automatically generates the corresponding container properties as shown previously. The Dysnomia NixOS module integrates with all NixOS features for which Dysnomia provides a plugin.

In addition to containers, we can also specify the available mutable components as part of a NixOS configuration:

dysnomia.components = {
  mysql-database = {
    rooms = pkgs.writeTextFile {
      name = "rooms";
      text = ''
        create table room
        ( Room     VARCHAR(10)    NOT NULL,
          Zipcode  VARCHAR(6)     NOT NULL,
          PRIMARY KEY(Room)
        );
      '';
    };
    staff = ...
    zipcodes = ...
  };

  tomcat-webapplication = {
    ...
  };
};

As can be observed in the above example, the dysnomia.components attribute set captures the available mutable components per container. For the mysql-database container, we have defined three databases: rooms, staff and zipcodes. Each attribute refers to a Nix build function that produces an SQL file representing the initial state of the database on first activation (typically a schema).

Besides MySQL databases, we can use the tomcat-webapplication attribute to automatically deploy Java web applications to the Apache Tomcat servlet container. The corresponding values of each mutable component refer to the result of a Nix build function that produce a Java web application archive (WAR file).

The Dysnomia module automatically composes a directory with symlinks referring to the generated mutable component configurations reachable through the DYSNOMIA_COMPONENTS_PATH environment variable.

Distributed infrastructure state management


In addition to deploying mutable components belonging to a single NixOS configuration, I have mapped the NixOS-level Dysnomia deployment concepts to networks of NixOS machines by extending the DisnixOS toolset (the Disnix extension integrating Disnix' service deployment concepts with NixOS' infrastructure deployment).

It may not have been stated explicitly in any of my previous blog posts, but DisnixOS can also be used deploy a network of NixOS configurations to target machines in a network. For example, we can compose a networked NixOS configuration that includes the machine configuration shown previously:

{
  test1 = import ./configurations/mysql-tomcat.nix;
  test2 = import ./configurations/empty.nix;
}

The above configuration file is an attribute set defining two machine configurations. The first attribute (test1) refers to our previous NixOS configuration running MySQL and Apache Tomcat as system services.

We can deploy the networked configuration with the following command-line instruction:

$ disnixos-deploy-network network.nix

As a sidenote: although DisnixOS can deploy networks of NixOS configurations, NixOps does a better job in accomplishing this. Moreover, DisnixOS only supports deployment of NixOS configurations to bare-metal servers and cannot instantiate any VMs in the cloud.

Furthermore, what DisnixOS also does differently compared to NixOps, is invoking Dysnomia to activate or deactivate NixOS configurations -- the corresponding NixOS plugin executes the big monolithic NixOS activation script for the activation step and runs nixos-rebuild --rollback switch for the deactivation step.

I have extended the Dysnomia's nixos-configuration plugin with state management operations. Snapshotting the state of a NixOS configuration simply means running:

$ dysnomia-containers --snapshot

Likewise, restoring the state of a NixOS configuration can be done with:

$ dysnomia-containers --restore

And removing obsolete state with:

$ dysnomia-containers --collect-garbage

When using Disnix to manage state, we may have mutable components deployed as part of a system configuration and mutable components deployed as services in the same environment. To prevent the snapshots of the services to conflict with the ones belonging to a machine's system configuration, we set the DYSNOMIA_STATEDIR environment variable to: /var/state/dysnomia-nixos for system-level state management and to /var/state/dysnomia for service-level state management to keep them apart.

With these additional operations, we can capture the state of all mutable components part of the system configurations in a network:

$ disnixos-snapshot-network network.nix

This yields a snapshot of the test1 machine stored in the Dysnomia snapshot store on the coordinator machine:

$ dysnomia-snapshots --query-latest
nixos-configuration/nixos-system-test1-16.03pre-git/4c4751f10648dfbbf8e25c924391e80913c8a6a600f7b481d73cd88ff3d32730

When inspecting the contents of the NixOS system configuration snapshot, we will observe:

$ cd /var/state/dysnomia/snapshots/$(dysnomia-snapshots --query-latest)
$ find -maxdepth 3 -mindepth 3 -type d
./mysql-database/rooms/faede34f3bf658884020a31ca98f16503da9a90bf3313cc96adc5c2358c0b054
./mysql-database/staff/e9af7042064c33379ba9fe9272f61986b5a85de63c57732f067695e499a3a18f
./mysql-database/zipcodes/637faa3e79ec6c2db71ac4023e86f29890e54233ea6592680fd88481725d44a3

The contents of the NixOS system configuration snapshot consist all snapshots of the mutable components belonging to its system configuration.

Similar to restoring the state of individual mutable components, we can restore the state of all mutable components part of a system configuration in a network of machines:

$ disnixos-snapshot-network network.nix

And remove their obsolete state, by running:

$ disnixos-delete-network-state network.nix

TL;DR: Discussion


In this blog post, I have described an extension to Dysnomia that makes it possible to manage the state of mutable components belonging to a system configuration, and a NixOS module making it possible to automatically configure Dysnomia from a NixOS configuration file.

This new extension makes it possible to deploy mutable components belonging to systems that cannot be divided into distributable deployment units (or services in a Disnix-context), such as monolithic system configurations.

To summarize: if it is desired to manage the state of mutable components in a NixOS configuration, you need to provide a number of additional configuration settings. First, we must enable Dysnomia:

dysnomia.enable = true;

Then enable a number of container services, such as MySQL:

services.mysql.enable = true;

(As explained earlier, the Dysnomia module will automatically generate its corresponding container properties).

Finally, we can specify a number of available mutable components that can be deployed automatically, such as a MySQL database:

dysnomia.components = {
  mysql-database = {
    rooms = pkgs.writeTextFile {
      name = "rooms";
      text = ''
        create table room
        ( Room     VARCHAR(10)    NOT NULL,
          Zipcode  VARCHAR(6)     NOT NULL,
          PRIMARY KEY(Room)
        );
      '';
    };
  };
}

After deploying a Dysnomia-enabled NixOS system configuration through:

$ nixos-rebuild switch

We can deploy the mutable components belonging to it, by running:

$ dysnomia-containers --deploy

Unfortunately, managing mutable components on a system-level also has a huge drawback, in particular in distributed environments. Snapshots of entire system configurations are typically too coarse -- whenever the state of any of the mutable components change, a new system-level composite snapshot is generated that is composed of the snapshots of all mutable components.

Typically, these snapshots contain redundant data that is not shared among snapshot generations (although there are potential solutions to cope with this, I have not implemented any optimizations yet). As explained in my previous Dysnomia-related blog posts, snapshotting individual components can already be quite expensive (such as large databases), and these costs may become significantly larger on a system-level.

Likewise, restoring state on system-level implies that the state of all mutable components will be restored. This is also typically undesired as it may be too destructive and time consuming. Moreover, moving the state from one machine to another when a mutable components gets migrated is also much more expensive.

For more control and more efficient deployment of mutable components, it would typically be better to develop a Disnix service-model so that they can be managed individually.

Because of these drawbacks, I am not prominently advertising DisnixOS' distributed state management features. Moreover, I also did not attempt to integrate these features into NixOps, for the same reasons.

References


The dysnomia-containers tool as well as the distributed infrastructure management facilities have been integrated into the development versions of Dysnomia and DisnixOS, and will become part of the next Disnix release.

I have also added a sub example to the Java version of the Disnix staff tracker example to demonstrate how these features can be used.

As a final note, the Dysnomia NixOS module has not yet been integrated in NixOS. Instead, the module must be imported from a Dysnomia Git clone, by adding the following line to a NixOS configuration file:

imports = [ /home/sander/dysnomia/dysnomia-module.nix ];

Thursday, March 17, 2016

The NixOS project and deploying systems declaratively


Last weekend I was in Wrocław, Poland to attend wroc_love.rb, a conference tailored towards (but not restricted to) applications Ruby related. The reason for me to go to there is because I was invited to give a talk about NixOS.

As I have never visited neither Poland nor a Ruby-related conference before, I did not really know what to expect, but it turned out to be a nice experience. The city, venue and people were all quite interesting, and I liked it very much.

In my talk I basically had two objectives: providing a brief introduction to NixOS and diving into one of its underlying visions: declarative deployment. From my perspective, the former aspect is not particularly new as I have given talks about the NixOS project many times (for example, I also crafted three explanation recipes).

Something that I have not done before is diving into the latter aspect. In this blog post, I'd like to elaborate about it, discuss why it is appealing, and in what extent certain tools reach it.

On being declarative


I have used the word declarative in many of my articles. What is supposed to mean?

I have found a nice presentation online that elaborates on four kinds sentences in linguistics. One of the categories covered in the slides are declarative sentences that (according to the presentation) can be defined as:

A declarative sentence makes a statement. It is punctuated by a period.

As an example, the presentation shows:

The dog in the neighbor's yard is barking.

Another class of sentences that the presentation describes are imperative sentences which it defines as:

An imperative sentence is a command or polite request. It ends in a period or exclamation mark.

The following xkcd comic shows an example:


(Besides these two categories of sentences described earlier, the presentation also covers interrogative sentences and exclamatory sentences, but I won't go into detail on that).

On being declarative in programming


In linguistics, the distinction between declarative and imperative sentences is IMO mostly clear -- declarative sentences state facts and imperative sentences are commands or requests.

A similar distinction exists in programming as well. For example, on Wikipedia I found the following definition for declarative programming (the Wikipedia article cites the article: "Practical Advantages of Declarative Programming" written by J.W. Lloyd, which I unfortunately could not find anywhere online):

In computer science, declarative programming is a programming paradigm -- a style of building the structure and elements of computer programs -- that expresses the logic of a computation without describing its control flow.

Imperative programming is sometimes seen as the opposite of declarative programming, but not everybody agrees. I found an interesting discussion blog post written by William Cook that elaborates on their differences.

His understanding of the declarative and imperative definitions are:

Declarative: describing "what" is to be computed rather than "how" to compute the result/behavior

Imperative: a description of a computation that involves implicit effects, usually mutable state and input/output.

Moreover, he says the following:

I agree with those who say that "declarative" is a spectrum. For example, some people say that Haskell is a declarative language, but I my view Haskell programs are very much about *how* to compute a result.

I also agree with William Cook's opinion that declarative is a spectrum -- contrary to linguistics, it is hard to draw a hard line between what and how in programming. Some programming languages that are considered imperative, e.g. C, modify mutable state such as variables:

int a = 5;
a += 3;

But if we would modify the code to work without mutable state, it still remains more a "how" description than a "what" description IMO:

int sum(int a, int b)
{
    return a + b;
}

int result = sum(5, 3);

Two prominent languages that are more about what than how are HTML and CSS. Both technologies empower the web. For example, in HTML I can express the structure of a page:

<!DOCTYPE html>

<html>
    <head>
        <title>Test</title>
        <link rel="stylesheet" href="style.css" type="text/css">
    </head>
    <body>
        <div id="outer">
            <div id="inner">
                <p>HTML and CSS are declarative and so cool!</p>
            </div>
        </div>
    </body>
</html>

In the above code fragment, I define two nested divisions in which a paragraph of text is displayed.

In CSS. I can specify what the style is of these page elements:

#outer {
    margin-left: auto;
    margin-right: auto;
    width: 20%;
    border-style: solid;
}

#inner {
    width: 500px;
}

In the above example, we state that the outer div should be centered, have a width of 20% of the page, and a solid border should be drawn around it. The inner div has a width of 500 pixels.

This approach can be considered declarative, because you do not have to specify how to render the page and the style of the elements (e.g. the text, the border). Instead, this is what the browser's layout engine figures out. Besides being responsible for rendering, it has a number of additional benefits as well, such as:

  • Because it does not matter (much) how a page is rendered, we can fully utilize a system's resources (e.g. a GPU) to render a page in a faster and more fancy way, and optionally degrade a page's appearance if a system's resources are limited.
  • We can also interpret the page in many ways. For example, we can pass the text in paragraphs to a text to speech engine, for people that are visually impaired.

Despite listing some potential advantages, HTML and CSS are not perfect at all. If you would actually check how the example gets rendered in your browser, then you will observe one of CSS's many odd traits, but I am not going to reveal what it is. :-)

Moreover, despite being more declarative (than code written in an imperative programming language such as C) even HTML and CSS can sometimes be considered a "how" specification. For example, you may want to render a photo gallery on your web page. There is nothing in HTML and CSS that allows you to concisely express that. Instead, you need to decompose it into "lower level" page elements, such as paragraphs, hyperlinks, forms and images.

So IMO, being declarative depends on what your goal is -- in some contexts you can exactly express what you want, but in others you can only express things that are in service of something else.

On being declarative in deployment


In addition to development, you eventually have to deploy a system (typically to a production environment) to make it available to end users. To deploy a system you must carry out a number of activities, such as:

  • Building (if a compiled language is used, such as Java).
  • Packaging (e.g. into a JAR file).
  • Distributing (transferring artifacts to the production machines).
  • Activating (e.g. a Java web application in a Servlet container).
  • In case of an upgrade: deactivating obsolete components.

Deployment is often much more complicated than most people expect. Some things that make it complicated are:

  • Many kinds of steps need to be executed, in particular when the technology used is diverse. Without any automation, it becomes extra complicated and time consuming.
  • Deployment in production must be typically done on a large scale. In development, a web application/web service typically serves one user only (the developer), while in production it may need to serve thousands or millions of users. In order to serve many users, you need to manage a cluster of machines having complex constraints in terms of system resources and connectivity.
  • There are non-functional requirements that must be met. For example, while upgrading you want to minimize a system's downtime as much possible. You probably also want to roll back to a previous version if an upgrade went wrong. Accomplishing these properties is often much more complicated than expected (sometimes even impossible!).

As with linguistics and programming, I see a similar distinction in deployment as well -- carrying out the above listed activities are simply the means to accomplish deployment.

What I want (if I need to deploy) is that my system on my development machine becomes available in production, while meeting certain quality attributes of the system that is being deployed (e.g. it could serve thousands of users) and quality attributes of the deployment process itself (e.g. that I can easily roll back in case of an error).

Mainstream solutions: convergent deployment


There are a variety of configuration management tools claiming to support declarative deployment. The most well-known category of tools implement convergent deployment, such as: CFEngine, Puppet, Chef, Ansible.

For example, Chef is driven by declarative deployment specifications (implemented in a Ruby DSL) that may look as follows (I took this example from a Chef tutorial):

...

wordpress_latest = Chef::Config[:file_cache_path] + "/wordpress-latest.tar.gz"

remote_file wordpress_latest do
  source "http://wordpress.org/latest.tar.gz"
  mode "0644"
end

directory node["phpapp"]["path"] do
  owner "root"
  group "root"
  mode "0755"
  action :create
  recursive true
end

execute "untar-wordpress" do
  cwd node['phpapp']['path']
  command "tar --strip-components 1 -xzf " + wordpress_latest
  creates node['phpapp']['path'] + "/wp-settings.php"
end

The objective of the example shown above is deploying a Wordpress web application. What the specification defines is a tarball that must be fetched from the Wordpress web site, a directory that must be created in which a web application is hosted and a tarball that needs to be extracted into that directory.

The specification can be considered declarative, because you do not have to describe the exact steps that need to be executed. Instead, the specification captures the intended outcome of a set of changes and the deployment system converges to the outcome. For example, for the directory that needs to be created, it first checks if it already exists. If so, it will not be created again. It also checks whether it can be created, before attempting to do it.

Converging, instead of directly executing steps, provides additional safety mechanisms and makes deployment processes more efficient as duplicate work is avoided as much as possible.

There are also a number of drawbacks -- it is not guaranteed (in case of an upgrade) that the system can converge to a new set of outcomes. Moreover, while upgrading a system we may observe downtime (e.g. when a new version of the Wordpress is being unpacked). Also, doing a roll back to a previous configuration cannot be done instantly.

Finally, convergent deployment specifications do not guarantee reproducible deployment. For example, the above code does not capture the configuration process of a web server and a PHP extension module, which are required dependencies to run Wordpress. If we would apply the changes to a machine where these components are missing, the changes may still apply but yield a non working configuration.

The NixOS approach


NixOS also supports declarative deployment, but in a different way. The following code fragment is an example of a NixOS configuration:

{pkgs, ...}:

{
  boot.loader.grub.device = "/dev/sda";

  fileSystems = [ { mountPoint = "/"; device = "/dev/sda2"; } ];
  swapDevices = [ { device = "/dev/sda1"; } ];
  
  services = {
    openssh.enable = true;
    
    xserver = {
      enable = true;
      desktopManager.kde4.enable = true;
    };
  };
  
  environment.systemPackages = [ pkgs.mc pkgs.firefox ];
}

In a NixOS configuration you describe what components constitute a system, rather than the outcome of changes:

  • The GRUB bootloader should be installed on the MBR of partition: /dev/sda.
  • The /dev/sda2 partition should be mounted as a root partition, /dev/sda1 should be mounted as a swap partition.
  • We want Mozilla Firefox and Midnight Commander as end user packages.
  • We want to use the KDE 4.x desktop.
  • We want to run OpenSSH as a system service.

The entire machine configuration can be deployed by running single command-line instruction:

$ nixos-rebuild switch

NixOS executes all required deployment steps to deploy the machine configuration -- it downloads or builds all required packages from source code (including all its dependencies), it generates the required configuration files and finally (if all the previous steps have succeeded) it activates the new configuration including the new system services (and deactivating the system services that have become obsolete).

Besides executing the required deployment activities, NixOS has a number of important quality attributes as well:

  • Reliability. Nix (the underlying package manager) ensures that all dependencies are present. It stores new versions of packages next to old versions, without overwriting them. As a result, you can always switch back to older versions if needed.
  • Reproducibility. Undeclared dependencies do not influence builds -- if a build works on one machine, then it works on others as well.
  • Efficiency. Nix only deploys packages and configuration files that are needed.

NixOS is a Linux distribution, but the NixOS project provides other tools bringing the same (or similar) deployment properties to other areas. Nix works on package level (and works on other systems besides NixOS, such as conventional Linux distributions and Mac OS X), NixOps deploys networks of NixOS machines and Disnix deploys (micro)services in networks of machines.

The Nix way of deploying is typically my preferred approach, but these tools also have their limits -- to benefit from the quality properties they provide, everything must be deployed with Nix (and as a consequence: specified in Nix expressions). You cannot take an existing system (deployed by other means) first and change it later, something that you can actually do with convergent deployment tools, such as Chef.

Moreover, Nix (and its sub projects) only manage the static parts of a system such as packages and configuration files (which are made immutable by Nix by making them read-only), but not any state, such as databases.

For managing state, external solutions must be used. For example, I developed a tool called Dysnomia with similar semantics to Nix but it is not always good solution, especially for big chunks of state.

How declarative are these deployment solutions?


I have heard some people claiming that the convergent deployment models are not declarative at all, and the Nix deployment models are actually declarative because they do not specify imperative changes.

Again, I think it depends on how you look at it -- basically, the Nix tools solve problems in a technical domain from declarative specifications, e.g. Nix deploys packages, NixOS entire machine configurations, NixOps networks of machines etc., but typically you would do these kinds of things to accomplish something else, so in a sense you could still consider these approach a "how" rather than a "what".

I have also developed domain-specific deployment tools on top of the tools part of the Nix project allowing me to express concisely what I want in a specific domain:

WebDSL


WebDSL is a domain-specific language for developing web applications with a rich data model, supporting features such as domain modelling, user interfaces and access control. The WebDSL compiler produces Java web applications.

In order to deploy a WebDSL application in a production environment, all kinds of complicated tasks need to be carried out -- we must install a MySQL server, Apache Tomcat server, deploy the web application to the Tomcat server, tune specific settings, and install a reverse proxy that does caching etc.

You typically do not want to express such things in a deployment model. I have developed a tool called webdsldeploy allowing someone to only express the deployment properties that matter for WebDSL applications on a high level. Underneath, the tool consults NixOps (formerly known as Charon) to compose system configurations hosting the components required to run the WebDSL application.

Conference compass


Conference Compass sells services to conference organizers. The most visible part of their service are apps for conference attendees, providing features such as displaying a conference program, list of speakers and floor maps of the venue.

Each customer basically gets "their own app" -- an app for a specific customers has their preferred colors, artwork, content etc. We use a single code base to produce specialized apps.

To produce such specialized apps, we do not want to specify things such as how to build an app for Android through Nix, an app for iOS through Nix, and how to produce debug and release versions etc. These are basically just technical details.

Instead, we have developed our own custom tool that is driven by a specification that concisely expresses what customizations we want (e.g. artwork) and produces the artefacts we want accordingly.

We use a similar approach for our backends -- each app connects to its own dedicated backend allowing users to configure the content displayed in the app. The configurator can also be used to dynamically update the content that is displayed in the apps. For big customers, we offer an additional service in which we develop programs that automatically import data from their information systems.

For the deployment of these backend instances, we do not want to express things such as machines, database services, and the deployment of NPM and Python packages.

Instead, we use a domain-specific tool that is driven by a model that concisely expresses what configurators we want and which third party integrations they provide. The tool is responsible for instantiating virtual machines in the cloud and deploying the services to it.

Conclusion


In this blog post I have elaborated about being declarative in deployment and discussed in what extent certain tools reach it. As with declarative programming, being declarative in deployment is a spectrum.

References


Some aspects discussed in this blog post are covered in my PhD thesis:
  • I did a more elaborate comparison of infrastructure deployment solutions in Chapter 6. I also cover convergent deployment and used CFEngine as an example.
  • I have covered webdsldeploy in Chapter 11, including some background information about WebDSL and its deployment aspects.
  • The overall objective of my PhD thesis is constructing deployment tools for specific domains. Most of the chapters cover the ingredients to do so, but Chapter 3 explains a reference architecture for deployment tools, having similar (or comparable) properties to tools in the Nix project.

For convenience, I have also embedded the slides of my presentation into this web page:

Monday, February 29, 2016

Managing NPM flat module installations in a Nix build environment

Some time ago, I have reengineered npm2nix and described some of its underlying concepts in a blog post. In the reengineered version, I have ported the implementation from CoffeeScript to JavaScript, refactored/modularized the code, and I have been improving the implementation to more accurately simulate NPM's dependency organization, including many of its odd traits.

I have observed that in the latest Node.js (the 5.x series) NPM's behaviour has changed significantly. To cope with this, I did yet another major reengineering effort. In this blog post, I will describe the path that has lead to the latest implementation.

The first attempt


Getting a few commonly used NPM packages deployed with Nix is not particularly challenging, but to make it work completely right turns out to be quite difficult -- the early npm2nix implementations generated Nix expressions that build every package and all of its dependencies in separate derivations (in other words: each package and dependency translates to a separate Nix store path). To allow a package to find its dependencies, the build script creates a node_modules/ sub folder containing symlinks that refer to the Nix store paths of the packages that it requires.

NPM packages have loose dependency specifiers, e.g. wildcards and version ranges, whereas Nix package dependencies are exact, i.e. they bind to packages that are identified by unique hash codes derived from all build time dependencies. npm2nix makes this translation by "snapshotting" the latest conforming version and turning that into into a Nix package.

For example, one my own software projects (NiJS) has the following package configuration file:

{
  "name" : "nijs",
  "version" : "0.0.23",
  "dependencies" : {
    "optparse" : ">= 1.0.3",
    "slasp": "0.0.4"
  }
  ...
}

The package configuration states that it requires optparse version 1.0.3 or higher, and slasp version 0.0.4. Running npm install results in the following directory structure of dependencies:

nijs/
  ...
  package.json
  node_modules/
    optparse/
      package.json
      ...
    slasp/
      package.json
      ...

A node_modules/ folder gets created in which each sub directory represents an NPM package that is a dependency of NiJS. In the older versions of npm2nix, it gets translated as follows:

/nix/store/ab12pq...-nijs-0.0.24/
  ...
  package.json
  node_modules/
    optparse -> /nix/store/4pq1db...-optparse-1.0.5
    slasp -> /nix/store/8j12qp...-slasp-0.0.4
/nix/store/4pq1db...-optparse-1.0.5/
  ...
  package.json
/nix/store/8j12qp...-slasp-0.0.4/
  ...
  package.json

Each involved package is stored in its own private folder in the Nix store. The NiJS package has a node_modules/ folder containing symlinks to its dependencies. For many packages, this approach works well enough, as it at least provides a conforming version for each dependency that it requires.

Unfortunately, it is possible to run into oddities as well. For example, a package that does not work properly in such a model is ironhorse.

For example, we could declare mongoose and ironhorse dependencies of a project:

{
  "name": "myproject",
  "version": "0.0.1",
  "dependencies": {
    "mongoose": "3.8.5",
    "ironhorse": "0.0.11"
  }
}

Ironhorse has an overlapping dependency with the project's dependencies -- it also depends on mongoose, as shown in the following package configuration:

{
  "name": "ironhorse",
  "version": "0.0.11",
  "license" : "MIT",
  "dependencies" : {
    "underscore": "~1.5.2",
    "mongoose": "*",
    "temp": "*",
    ...
  },
  ...
}

Running 'npm install' on project level yields the following directory structure:

myproject/
  ...
  package.json
  node_modules/
    mongoose/
      ...
    ironhorse/
      ...
      package.json
      node_modules/
        underscore/
        temp/

Note that the mongoose only appears one time in the hierarchy of node_modules/ folders despite that it has been declared as a dependency twice.

In contrast, when using an older version of npm2nix, the following directory structure gets generated:

/nix/store/67ab07...-myproject-0.0.1
  ...
  package.json
  node_modules/
    mongoose -> /nix/store/ec704c...-mongoose-3.8.5
    ironhorse -> /nix/store/3ee85e...-ironhorse-0.0.11
/nix/store/3ee85e...-ironhorse-0.0.11
  ...
  package.json
  node_modules/
    underscore -> /nix/store/10af96...-underscore-1.5.2
    mongoose -> /nix/store/a37f75...-mongoose-4.4.5
    temp -> /nix/store/fae379...-temp-0.8.3
/nix/store/ec704c...-mongoose-3.8.5
  package.json
  ...
/nix/store/a37f75...-mongoose-4.4.5
  package.json
  ...
/nix/store/10af96...-underscore-1.5.2
/nix/store/fae379...-temp-0.8.3

In the above directory structure, we can observe that two different versions of mongoose have been deployed -- version 3.8.5 (as a dependency for the project) and version 4.4.5 (as a dependency for ironhorse). Having two different versions of mongoose deployed typically leads to problems.

The reason why npm2nix produces a different result is because whenever NPM encounters a dependency specification, it recursively searches the parent directories to find a conforming version. If a conforming version has been found that fits within the version range of a package dependency, it will not be included again. This is also the reason why NPM can "handle" cyclic dependencies (despite the fact that they are a bad practice) -- when a dependency has been encountered a second time, it will not be deployed again causing NPM to break the cycle.

npm2nix did not implement this kind behaviour -- it always binds a dependency to the latest conforming version, but as can be observed in the last example, this is not what NPM always does -- it could also bind to a shared dependency that may be older than the latest version in the NPM registry (As a sidenote: I wonder how many NPM users actually know about this detail!).

Second attempt: simulating shared dependency behaviour


One of the main objectives in the reengineered version (as described in my previous blog post), is to more accurately mimic NPM's shared dependency behaviour, as the old behaviour was particularly problematic for packages having cyclic dependencies -- Nix does not allow them and causes the evaluation of the entire Nixpkgs set on the Hydra build server to fail as a result.

The reengineered version worked, but the solution was quite expensive and controversial -- I compose Nix expressions of all packages involved, in which each dependency resolves to the latest corresponding version.

Each time a package includes a dependency, I propagate an attribute set to its build function telling it which dependencies have already been resolved by any of the parents. Resolved dependencies get excluded as a dependency.

To check whether a resolved dependency fits within a version range specifier, I have to consult semver. Because semver is unsupported in the Nix expression language, I use a trick in which I import Nix expressions generated by a build script (that invokes the semver command-line utility) to figure out which dependencies have been resolved already.

Besides consulting semver, I used another hack -- packages that have been resolved by any of the parents must be excluded as a dependency. However, NPM packages in Nix are deployed independently from each other in separate build functions and will fail because NPM expects them to present. To solve this problem, I create shims for the excluded packages, by substituting them by empty packages with the same name and version, and removing them after the package has been built.

Symlinking the dependencies also no longer worked reliably -- the CommonJS module system dereferences the location of the includer first and looks in the parent directories for shared dependencies relative from there. This means in case of a symlink, it incorrectly resolves to a Nix store path that has no meaningful parent directories. The only solution I could think of is copying dependencies instead of symlinking them.

To summarize: the new solution worked more accurately than the original version (and can cope with cyclic dependencies) but it is quite inefficient as well -- making copies of dependencies causes a lot of duplication (that would be a waste of disk space) and building Nix expressions in the instantiation phase makes the process quite slow.

Third attempt: computing the dependency graph ahead of time


Apart from the earlier described inefficiencies, the main reason that I had to do yet another major revision is that Node.js 5.x (that includes npm 3.x) executes so-called "flat-module installations. The idea is that when a package includes a dependency, it will be stored in a node_modules/ folder as high in the directory structure as possible without breaking any dependencies.

This new approach has a number of implications. For example, deploying the Disnix virtual hosts test web application with the old npm 2.x used to yield the following directory structure:

webapp/
  ...
  package.json
  node_modules/
    express/
      ...
      package.json
      node_modules/
        accepts/
        array-flatten/
        content-disposition/
        ...
    ejs/
      ...
      package.json

As can be observed in the structure above, the test webapp depends on two packages: express and ejs. Express has dependencies of its own, such as accepts, array-flatten, content-disposition. Because no parent node_modules/ folder provides them, they are included privately for the express package.

Running 'npm install' with the new npm 3.x yields the following directory structure:

webapp/
  ...
  package.json
  node_modules/
    accepts/
    array-flatten/
    content-disposition/
    express/
      ...
      package.json
    ejs/
      ...
      package.json

Since the libraries that express requires do not conflict with the includer's dependencies, they have been moved one level up to the parent package's node_modules/ folder.

Flattening the directory structure makes deploying a NPM project even more imperative -- previously, the dependencies that were included with a package depend on the state of the includer. Now we must also modify the entire directory hierarchy of dependencies by moving packages up in the directory structure. It also makes the resulting dependency graph less predictable. For example, the order in which dependencies are installed matters -- unless all dependencies are discarded and reinstalled from scratch, it may result in different kinds of dependency graphs.

If this flat module approach has all kinds of oddities, why would NPM uses such an approach, you may wonder? It turns out that the only reason is: better Windows support. On Windows, there is a limit on the length on paths and flattening the directory structure helps to prevent hitting it. Unfortunately, it comes at the price of making deployments more imperative and less predictable.

To simulate this flattening strategy, I had to revise npm2nix again. Because of its previous drawbacks and the fact that we have to perform even more imperative operations, I have decided to implement a new strategy in which I compute the entire dependency graph ahead of time by the generator, instead of hacking it into the evaluation phase of the Nix expressions.

Supporting private and shared dependencies works exactly the same as in the old implementation, but is now performed ahead of time. Additionally, I simulate the flat dependency structure as follows:

  • When a package requires a dependency: I check whether the parent directory has a conflicting dependency. This means: it either has a dependency bundled with the same name and a different version or indirectly binds to another parent that provides a conflicting version.
  • If the dependency conflicts: bundle the dependency in the current package.
  • If the dependency does not conflict: bind the package to the dependency (but do not include it) and consult the parent package one level higher.

Besides computing the dependency graph ahead of time, I also deploy the entire dependency graph in one Nix build function -- because including dependencies is stateful, it no longer makes sense to build them as individual Nix packages, that are supposed to be pure.

I have made the flattening algorithm optional. By default, the new npm2nix generates Nix expressions for Node.js 4.x (using the old npm 2.x) release:

$ npm2nix

By appending the -5 parameter, it generates Nix expressions for usage with Node.js 5.x (using the new npm 3.x with flat module installations):

$ npm2nix -5

I have tested the new approach on many packages including my public projects. The good news is: they all seem to work!

Unfortunately, despite the fact that I could get many packages working, the approach is not perfect and hard to get 100% right. For example, in a private project I have encountered bundled dependencies (dependencies that are statically included with a package). NPM also moves them up, while npm2nix merely generates an expression composing the dependency graph (that reflects flat module installations as much as possible). To fix this issue, we must also run a post processing step that moves dependencies up that are in the wrong places. Currently, this step has not been implemented yet in npm2nix.

Another issue is that we want Nix to obtain all dependencies instead of NPM. To prevent NPM from consulting external resources, we substitute some version specifiers (such as Git repositories) by a wildcard: *. These version specifiers sometimes confuse NPM, despite the fact that the directory structure matches NPM's dependency structure.

To cope with these imperfections, I have also added an option to npm2nix to refrain it from running npm install -- in many cases, packages still work fine despite NPM being confused. Moreover, the npm install step in the Nix builder environment merely serves as a validation step -- the Nix builder script is responsible for actually providing the dependencies.

Discussion


In this blog post, I have described the path that has lead to a second reengineered version of npm2nix. The new version computes dependency graphs ahead of time and can mostly handle npm 3.x's flat module installations. Moreover, compared to the previous version, it does no longer rely on very expensive and ugly hacks.

Despite the fact that I can now more or less handle flat installations, I am still not quite happy. Some things that bug me are:

  • The habit of "reusing" modules that have been bundled with any of the includers, makes it IMO difficult and counter-intuitive to predict which version will actually be used in a certain context. In some cases, packagers might expect that the latest version of a version range will be used, but this is not guaranteed to be the case. This could, for example, reintroduce security and stability issues without end users noticing (or expecting) it.
  • Flat module installations are less deterministic and make it really difficult to predict what a dependency graph looks like -- the dependencies that appear at a certain level in the directory structure depend on the order in which dependencies are installed. Therefore, I do not consider this an improvement over npm 2.x.

Because of these drawbacks, I expect that NPM will reconsider some of its concepts again in the future causing npm2nix to break again.

I would recommend the NPM developers to use the following approach:

  • All involved packages should be stored in a single node_modules/ folder instead of multiple nested hierarchies of node_modules/ folders.
  • When a module requests another module, the module loader should consult the package.json configuration file of the package where the includer module belongs to. It should take the latest conforming version in the central node_modules/ folder. I consider taking the last version of a version range to be less counter-intuitive than taking any conforming version.
  • To be able to store multiple versions of packages in a single node_modules/ folder, a better directory naming convention should be adopted. Currently, NPM only identifies modules by name in a node_modules/ folder, making it impossible to store two versions next to each other in one directory.

    If they would, for example, use both the name and version numbers in the directory names, more things are possible. Adding more properties in the path names makes sharing even better -- for example, a package with a name and version number could originate from various sources, e.g. the NPM registry or a Git repository -- reflecting this in the path makes it possible to store more variants next to each other in a reliable way.

    Naming things to improve shareability is not really rocket science -- Nix uses hash codes (that are derived from all build-time dependencies) to uniquely identify packages and the .NET Global Assembly Cache uses so-called strong names that include various naming attributes, such as cryptographic keys to ensure that no library conflicts. I am convinced that adopting a better naming convention for storing NPM packages would be quite beneficial as well.
  • To cope with cyclic dependencies: I would simply say that it suffices to disallow them. Packages are supposed to be units of reuse, and if two packages mutually depend on each other, then they should be combined into one package.

Availability


The second reengineered npm2nix version can be obtained from my GitHub page. The code resides in the reengineering2 branch.

Thursday, January 28, 2016

Disnix 0.5 release announcement and some reflection

In this blog post, I'd like to announce the next Disnix release. At the same time, I noticed that it has been eight years ago that I started developing it, so this would also be a nice opportunity to do some reflection.

Some background information


The idea was born while I was working on my master's thesis. A few months prior, I got familiar with Nix and NixOS -- I read Eelco Dolstra's PhD thesis, managed to package some software, and wrote a couple of services for NixOS.

Most of my packing work was done to automate the deployment of WebDSL applications, a case study in domain-specific language engineering, that is still an ongoing research project in my former research group. WebDSL's purpose is to be a domain-specific language for developing dynamic web applications with a rich data model.

Many aspects in Nix/NixOS were quite "primitive" compared to today's implementations -- there was no NixOS module system, making it less flexible to create additions. Many packages that I needed were missing and I had to write Nix expressions for them myself, such as Apache Tomcat, MySQL, and Midnight Commander. Also the desktop experience, such as KDE, was quite primitive, as only the base package was supported.

As part of my master's thesis project, I did an internship at the Healthcare Systems Architecture group at Philips Research. They had been developing a platform called SDS2, which purpose was to provide asset tracking and utilization analysis services for medical equipment.

SDS2 qualifies itself as a service-oriented system (a term that people used to talk frequently about in the past, but not anymore :) ). As such, it can be decomposed into a set of distributable components (a.k.a. services) that interact with each other through "standardized protocols" (e.g. SOAP), sometimes through network links.

There are a variety of reasons why SDS2 has a distributed architecture. For example, data that has been gathered from medical devices may have to be physically stored inside a hospital for privacy reasons. The analysis components may require a lot of computing power and would perform better if they run in a data center with a huge amount of system resources.

Being able to distribute services is good for many reasons (e.g. in meeting certain non-functional requirements such as privacy), but it also has a big drawback -- services are software components, and one of their characteristics is that they are units of deployment. Deploying a single service without any (or proper) automation to one machine is already complicated and time consuming, but deploying a network of machines is many times as complex.

The goal of my thesis assignment was to automate SDS2's deployment in distributed environments using the Nix package manager as a basis. Nix provides a number of unique properties compared to many conventional deployment solutions, such as fully automated deployment from declarative specifications, and reliable and reproducible deployment. However, it was also lacking a number of features to provide the same or similar kinds of quality properties to deployment processes of service-oriented systems in networks of machines.

The result of my master's thesis project was the first prototype of Disnix that I never officially released. After my internship, I started my PhD research and resumed working on Disnix (as well as several other aspects). This resulted in a second prototype and two official releases eventually turning Disnix into what it is today.

Prototype 1


This was the prototype resulting from my master's thesis and was primarily designed for deploying SDS2.

The first component that I developed was a web service (using similar kinds of technologies as SDS2, such as Apache Tomcat and Apache Axis2) exposing a set of deployment operations to remote machines (most of them consulting the Nix package manager).

To cope with permissions and security, I decided to make the web service just an interface around a "core service" that was responsible for actually executing the deployment activities. The web service used the D-Bus protocol to communicate with the core.

On top of the web service layer, I implemented a collection of tools each executing a specific deployment activity in a network of machines, such as building, distributing and activating services. There were also a number of tools combining deployment activities, such as the "famous" disnix-env command responsible for executing all the activities required to deploy a system.

The first prototype of disnix-env, in contrast to today's implementation, provided two deployment procedure variants: building on targets and building on the coordinator.

The first variant was basically inspired by the manual workflow I used to carry out to get SDS2 deployed -- I manually installed a couple of NixOS machines, then used SSH to remotely connect to them, there I would do a checkout of Nixpkgs and all the other Nix expressions that I need, then I would deploy all packages from source and finally I modified the system configuration (e.g. Apache Tomcat) to run the web services.

Unfortunately, transferring Nix expressions is not an easy process, as they are rarely self contained and typically rely on other Nix expression files scattered over the file system. While thinking about a solution, I "discovered" that the Nix expression evaluator creates so-called store derivation files (low-level build specifications) for each package build. Store derivations are also stored in the Nix store next to ordinary packages, including their dependencies. I could instead instantiate a Nix expression on the coordinator, transfer the closure of store derivation files to a remote machine, and build them there.

After some discussion with my company supervisor Merijn de Jonge, I learned that compiling on target machines was undesired, in particular in production environments. Then I learned more about Nix's purely functional nature, and "discovered" that builds are referentially transparent -- for example, it should not matter where a build has been performed. As long as the dependencies remain the same, the outcome would be the same as well. With this "new knowledge" in mind, I implemented a second deployment procedure variant that would do the package builds on the coordinator machine, and transfer their closures (dependencies) to the target machines.

As with the current implementation, deployment in Disnix was driven by three kinds of specifications: the services model, infrastructure model and distribution model. However, their notational conventions were a bit different -- the services model already knew about inter-dependencies, but propagating the properties of inter-dependencies to build functions was an ad-hoc process. The distribution model was a list of attribute sets also allowing someone to specify the same mappings multiple times (which resulted in undefined outcomes).

Another primitive aspect was the activation step, such as deploying web applications inside Apache Tomcat. It was basically done by a hardcoded script that only knew about Java web applications and Java command-line tools. Database activation was completely unsupported, and had to be done by hand.

I also did a couple of other interesting things. I studied the "two-phase commit protocol" for upgrading distributed systems atomically and mapped its concepts to Nix operations, to support (almost) atomic upgrades. This idea resulted in a research paper that I have presented at HotSWUp 2008.

Finally, I sketched a simple dynamic deployment extension (and wrote a partial implementation for it) that would calculate a distribution model, but time did not permit me to finish it.

Prototype 2


The first Disnix prototype made me quite happy in the early stages of my PhD research -- I gave many cool demos to various kinds of people, including our industry partner: Philips Healthcare and NWO/Jacquard: the organization that was funding me. However, I soon realized that the first prototype became too limited.

The first annoyance was my reliance on Java. Most of the tools in the Disnix distribution were implemented in Java and depended on the Java Runtime Environment, which is quite a big dependency for a set of command-line utilities. I reengineered most of the Disnix codebase and rewrote it in C. I only kept the core service (which was implemented in C already) and the web service interface, that I separated into an external package called DisnixWebService.

I also got rid of the reliance on a web service to execute remote deployment operations, because it was quite tedious to deploy it. I made the communication aspect pluggable and implemented an SSH plugin that became the default communication protocol (the web service protocol could still be used as an external plugin).

For the activation and deactivation of services, I developed a plugin system (Disnix activation scripts) and a set of modules supporting various kinds of services replacing the hardcoded script. This plugin system allowed me to activate and deactivate many kinds of components, including databases.

Finally, I unified the two deployment procedure variants of disnix-env into one procedure. Building on the targets became simply an optional step that was carried out before building on the coordinator.

Disnix 0.1


After my major reengineering effort, I was looking into publishing something about it. While working on a paper (which first version got badly rejected), I realized that services in a SOA-context are "platform independent" because of their interfaces, but they still have implementations underneath that could depend on many kinds of technologies. Heterogeneity makes deployment extra complicated.

There was still one piece missing to bring service-oriented systems to their full potential -- there was no multiple operating systems support in Disnix. The Nix package manager could also be used on several other operating systems besides Linux, but Disnix was bound to one operating system only (Linux).

I did another major reengineering effort to make the system architecture of the target systems configurable requiring me to change many things internally. I also developed new notational conventions for the Disnix models. Each service expression became a nested function in which the outer function corresponds to the intra-dependencies and the inner function to the inter-dependencies, and look quite similar to expressions for ordinary Nix packages. Moreover, I removed the ambiguity problem in distribution model by making it an attribute set.

The resulting Disnix version was first described in my SEAA 2010 paper. Shortly after the paper got accepted, I decided to officially release this version as Disnix 0.1. Many external aspects of this version are still visible in the current version.

Disnix 0.2


After releasing the first of Disnix, I realized that there were still a few pieces missing while automating deployment processes of service-oriented systems. One of the limitations of Disnix is that it expects machines to be present already that may have to run a number of preinstalled system services, such as MySQL, Apache Tomcat, and the Disnix service exposing remote deployment operations. These machines had to be deployed by other means first.

Together with Eelco Dolstra I had been working on declarative deployment and testing of networked NixOS configurations, resulting in a tool called nixos-deploy-network that deploys networks of NixOS machines and a NixOS test driver capable of spawning networks of NixOS virtual machines in which system integration tests can be run non-interactively. These contributions were documented in a tech report and the ISSRE 2010 paper.

I made Disnix more modular so that extensions could be built on top of it. The most prominent extension was DisnixOS that integrates NixOS deployment and the NixOS test driver's features with Disnix service deployment so that a service oriented system's deployment process could be fully automated.

Another extension was Dynamic Disnix, a continuation of the dynamic deployment extension that I never finished during my internship. Dynamic Disnix extends the basic toolset with an infrastructure discovery tool and a distribution generator using deployment planning algorithms from the academic literature to map services to machines. The extended architecture is described in the SEAMS 2011 paper.

The revised Disnix architecture has been documented in both the WASDeTT 2010 and SCP 2014 papers and was released as Disnix 0.2.

Disnix 0.3


After the 0.2 release I got really busy, which was partly caused by the fact that I had to write my PhD thesis and yet another research paper for an unfinished chapter.

The last Disnix-related research contribution was a tool called Dysnomia, which I had based on the Disnix activation scripts package. I augmented the plugins with experimental state deployment operations and changed the package into a new tool, that in (theory) could be combined with other tools as well, or used independently.

Unfortunately, I had to quickly rush out a paper for HotSWUp 2012 and the code was in a barely usable state. Moreover, the state management facilities had some huge drawbacks, so I was not that eager to get them integrated into the mainstream version.

Then I had to fully dedicate myself to completing my PhD thesis and for more than six months, I hardly wrote any code.

After finishing my first draft of my PhD thesis and waiting for feedback from my committee, I left academia and switched jobs. Because I had no use practical use cases for Disnix, and other duties in my new job, its development was done mostly in my spare time at a very low pace -- some things that I accomplished in that period is creating a 'slim' version of Dysnomia that supported all the activities in the HotSWUp paper without any snapshotting facilities.

Meanwhile, nixops-deploy-network got replaced by a new tool named Charon, that later became NixOps. In addition to deployment, NixOps could also instantiate virtual machines in IaaS environments, such as Amazon EC2. I modified DisnixOS to also integrate with NixOps to use its capabilities.

Three and a half years after the previous release (late 2014), my new employer wanted to deploy their new microservices-based system to a production environment, which made me quite motivated to work on Disnix again. I did some huge refactorings and optimized a few aspects to make it work for larger systems. Some interesting optimizations were concurrent data transfers and concurrent service activations.

I also implemented multi-connection protocol support. For example, you could use SSH to connect to one machine and SOAP to another.

After implementing the optimizations, I realized that I had reached a stable point and decided that it was a good time to announce the next release, after a few years of only little development activity.

Disnix 0.4


Despite being happy with the recent Disnix 0.3 release and using it to deploy many services to production environments, I quickly ran into another problem -- the services that I had to manage store data in their own dedicated databases. Sometimes I had to move services from one machine to another. Disnix (like the other Nix tools) does not manage state, requiring me to manually migrate data, which was quite painful.

I decided to dig up the state deployment facilities from the HotSWUp 2012 paper to cope with this problem. Despite having a number of limitations, the databases that I had to manage were relatively small (tens of megabytes), so the solution was still a good fit.

I integrated the state management facilities described in the paper from the prototype into the "production" version of Dysnomia, and modified Disnix to use them. I left out the incremental snapshot facilities described in the paper, because there was no practical use for them. When the work was done, I announced the next release.

Disnix 0.5


With Disnix 0.4, all my configuration management work was automated. However, I spotted a couple of inefficiencies, such as many unnecessary redeployments while upgrading. I solved this issue by making the target-specific services concept a first class citizen in Disnix. Moreover, I regularly had to deal with RAM issues and added on-demand activation support (by using the operating system's service manager, such as systemd).

There were also some user-unfriendly aspects that I improved -- better and more concise logging, more helpful error messages, --rollback, --switch-generation options for disnix-env, and some commands that work on the deployment manifest were extended to take the last deployed manifest into account when no parameters have been provided (e.g. disnix-visualize).

Conclusion


This long blog post describes how the current Disnix version (0.5) came about after nearly eight years of development. I'd like to announce its immediate availability! Consult the Disnix homepage for more information.

Friday, January 22, 2016

Integrating callback and promise based function invocation patterns (Asynchronous programming with JavaScript part 4)

It has been quiet for a while on my blog in the programming language domain. Over two years ago, I started writing a series of blog posts about asynchronous programming with JavaScript.

In the first blog post, I explained some general asynchronous programming issues, code structuring issues and briefly demonstrated how the async library can be used to structure code more properly. Later, I have written a blog post about promises, another abstraction mechanism dealing with asynchronous programming complexities. Finally, I have developed my own abstraction functions by investigating how JavaScript's structured programming language constructs (that are synchronous) translate to the asynchronous programming world.

In these blog posts, I have used two kinds of function invocation styles -- something that I call the Node.js-function invocation style, and the promises invocation style. As the name implies, the former is used by the Node.js standard library, as well as many Node.js-based APIs. The latter is getting more common in the browser world. As a matter of fact, many modern browsers, provide a Promise prototype as part of their DOM API allowing others to construct their own Promise-based APIs with it.

In this blog post, I will compare both function invocation styles and describe some of their differences. Additionally, there are situations in which I have to mix APIs using both styles and I have observed that it is quite annoying to combine them. I will show how to alleviate this pain a bit by developing my own generically applicable adapter functions.

Two example invocations


The most frequently used invocation style in my blog posts is something that I call the Node.js-function invocation style. An example code fragment that uses such an invocation is the following:

fs.readFile("hello.txt", function(err, data) {
    if(err) {
        console.log("Error while opening file: "+err);
    } else {
        console.log("File contents is: "+data);
    }
});

As you may see in the code fragment above, when we invoke the readFile() function, it returns immediately (to be precise: it returns, but it returns no value). We use a callback function (that is typically the last function parameter) to retrieve the results of the invocation (or the error if something went wrong) at a later point in time.

By convention, the first parameter of the callback is an error parameter that is not null if some error occurs. The remaining parameters are optional and can be used to retrieve the corresponding results.

When using promises (more specifically: promises that conform to the Promises/A and Promises/A+ specifications), we use a different invocation pattern that may look as follows:

Task.findAll().then(function(tasks) {
    for(var i = 0; i < tasks.length; i++) {
        var task = tasks[i];
        console.log(task.title + ": "+ task.description);
    }
}, function(err) {
    console.log("An error occured: "+err);
});

As with the previous example, the findAll() function invocation shown above also returns immediately. However, it also does something different compared to the Node.js-style function invocation -- it returns an object called a promise whereas the invocation in the previous example never returns anything.

By convention, the resulting promise object provides a method called then() in which (according the Promises/A and A+ standards) the first parameter is a callback that gets invoked when the function invocation succeeds and the second callback gets invoked when the function invocation fails. The parameters of the callback functions represent result objects or error objects.

Comparing the invocation styles


At first sight, you may probably notice that despite having different styles, both function invocations return immediately and need an "artificial facility" to retrieve the corresponding results (or errors) at a later point in time, as opposed to directly returning a result in a function.

The major difference is that in the promises invocation style, you will always get a promise as a result of an invocation. A promise provides a reference to something which corresponding result will be delivered in the future. For example, when running:

var tasks = Task.findAll();

I will obtain a promise that, at some point in the future, provides me an array of tasks. I can use this reference to do other things by passing the promise around (for example) as a function argument to other functions.

For example, I may want to construct a UI displaying the list of tasks. I can already construct pieces of it without waiting for the full list of tasks to be retrieved:

displayTasks(tasks);

The above function could, for example, already start rendering a header, some table cells and buttons without the results being available yet. The display function invokes the then() function when it really needs the data.

By contrast, in the Node.js-callback style, I have no reference to the pending invocation at all. This means that I always have to wait for its completion before I can render anything UI related. Because we are forced to wait for its completion, it will probably make the application quite unresponsive, in particular when we have to retrieve many task records.

So in general, in addition to better structured code, promises support composability whereas Node.js-style callbacks do not. Because of this reason, I consider promises to be more powerful.

However, there is also something that I consider a disadvantage. In my first blog post, I have shown the following Node.js-function invocation style pyramid code example as a result of nesting callbacks:

var fs = require('fs');
var path = require('path');

fs.mkdir("out", 0755, function(err) {
    if(err) throw err;
    
    fs.mkdir(path.join("out, "test"), 0755, function(err) {
        if (err) throw err;        
        var filename = path.join("out", "test", "hello.txt");

        fs.writeFile(filename, "Hello world!", function(err) {
            if(err) throw err;
                    
            fs.readFile(filename, function(err, data) {
                if(err) throw err;
                
                if(data == "Hello world!")
                    process.stderr.write("File is correct!\n");
                else
                    process.stderr.write("File is incorrect!\n");
            });
        });
    });
});

I have also shown in the same blog post, that I can use the async.waterfall() abstraction to flatten its structure:

var fs = require('fs');
var path = require('path');

filename = path.join("out", "test", "hello.txt");

async.waterfall([
    function(callback) {
        fs.mkdir("out", 0755, callback);
    },

    function(callback) {
        fs.mkdir(path.join("out, "test"), 0755, callback);
    },

    function(callback) {
        fs.writeFile(filename, "Hello world!", callback);
    },

    function(callback) {
        fs.readFile(filename, callback);
    },

    function(data, callback) {
        if(data == "Hello world!")
            process.stderr.write("File is correct!\n");
        else
            process.stderr.write("File is incorrect!\n");
    }

], function(err, result) {
    if(err) throw err;
});
As you may probably notice, the code fragment above is much more readable and better maintainable.

In my second blog post, I implemented a promises-based variant of the same example:

var fs = require('fs');
var path = require('path');
var Promise = require('rsvp').Promise;

/* Promise object definitions */

var mkdir = function(dirname) {
    return new Promise(function(resolve, reject) {
        fs.mkdir(dirname, 0755, function(err) {
            if(err) reject(err);
            else resolve();
        });
    });
};

var writeHelloTxt = function(filename) {
    return new Promise(function(resolve, reject) {
        fs.writeFile(filename, "Hello world!", function(err) {
            if(err) reject(err);
            else resolve();
        });
    });
};

var readHelloTxt = function(filename) {
    return new Promise(function(resolve, reject) {
        fs.readFile(filename, function(err, data) {
            if(err) reject(err);
            else resolve(data);
        });
    });
};

/* Promise execution chain */

var filename = path.join("out", "test", "hello.txt");

mkdir(path.join("out"))
.then(function() {
    return mkdir(path.join("out", "test"));
})
.then(function() {
    return writeHelloTxt(filename);
})
.then(function() {
    return readHelloTxt(filename);
})
.then(function(data) {
    if(data == "Hello world!")
        process.stderr.write("File is correct!\n");
    else
        process.stderr.write("File is incorrect!\n");
}, function(err) {
    console.log("An error occured: "+err);
});

As you may notice, because the then() function invocations can be chained, we also have a flat structure making the code better maintainable. However, the code fragment is also considerably longer than the async library variant and the unstructured variant -- for each asynchronous function invocation, we must construct a promise object, adding quite a bit of overhead to the code.

From my perspective, if you need to do many ad-hoc steps (and not having to compose complex things), callbacks are probably more convenient. For reusable operations, promises are typically a nicer solution.

Mixing function invocations from both styles


It may happen that function invocations from both styles need to be mixed. Typically mixing is imposed by third-party APIs -- for example, when developing a Node.js web application we may want to use express.js (callback based) for implementing a web application interface in combination with sequelize (promises based) for accessing a relational database.

Of course, you could write a function constructing promises that internally only use Node.js-style invocations or the opposite. But if you have to regularly intermix calls, you may end up writing a lot of boilerplate code. For example, if I would use the async.waterfall() abstraction in combination with promise-style function invocations, I may end up writing:

async.waterfall([
    function(callback) {
        Task.sync().then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },
    
    function(callback) {
        Task.create({
            title: "Get some coffee",
            description: "Get some coffee ASAP"
        }).then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },
    
    function(callback) {
        Task.create({
            title: "Drink coffee",
            description: "Because I need caffeine"
        }).then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },
    
    function(callback) {
        Task.findAll().then(function(tasks) {
            callback(null, tasks);
        }, function(err) {
            callback(err);
        });
    },
    
    function(tasks, callback) {
        for(var i = 0; i < tasks.length; i++) {
            var task = tasks[i];
            console.log(task.title + ": "+ task.description);
        }
    }
], function(err) {
    if(err) {
        console.log("An error occurred: "+err);
        process.exit(1);
    } else {
        process.exit(0);
    }
});

For each Promise-based function invocation, I need to invoke the then() function and in the corresponding callbacks, I must invoke the callback of each function block to propagate the results or the error. This makes the amount of code I have to write unnecessary long, tedious to write and a pain to maintain.

Fortunately, I can create a function that abstracts over this pattern:

function chainCallback(promise, callback) {
    promise.then(function() {
        var args = Array.prototype.slice.call(arguments, 0);
        
        args.unshift(null);
        callback.apply(null, args);
    }, function() {
        var args = Array.prototype.slice.call(arguments, 0);
        
        if(args.length == 0) {
            callback("Promise error");
        } else if(args.length == 1) {
            callback(args[0]);
        } else {
            callback(args);
        }
    });
}

The above code fragment does the following:

  • We define a function takes a promise and a Node.js-style callback function as parameters and invokes the then() method of the promise.
  • When the promise has been fulfilled, it sets the error parameter of the callback to null (to indicate that there is no error) and propagates all resulting objects as remaining parameters to the callback.
  • When the promise has been rejected, we propagate the resulting error object. Because the Node.js-style-callback requires a single defined object, we compose one ourselves if no error object was returned, and we return an array as an error object, if multiple error objects were returned.

Using this abstraction function, we can rewrite the earlier pattern as follows:

async.waterfall([
    function(callback) {
        prom2cb.chainCallback(Task.sync(), callback);
    },
    
    function(callback) {
        prom2cb.chainCallback(Task.create({
            title: "Get some coffee",
            description: "Get some coffee ASAP"
        }), callback);
    },
    
    function(callback) {
        prom2cb.chainCallback(Task.create({
            title: "Drink coffee",
            description: "Because I need caffeine"
        }), callback);
    },
    
    function(callback) {
        prom2cb.chainCallback(Task.findAll(), callback);
    },
    
    function(tasks, callback) {
        for(var i = 0; i < tasks.length; i++) {
            var task = tasks[i];
            console.log(task.title + ": "+ task.description);
        }
    }
], function(err) {
    if(err) {
        console.log("An error occurred: "+err);
        process.exit(1);
    } else {
        process.exit(0);
    }
});

As may be observed, this code fragment is more concise and significantly shorter.

The opposite mixing pattern also leads to issues. For example, we can first retrieve the list of tasks from the database (through a promise-style invocation) and then write it as a JSON file to disk (through a Node.js-style invocation):

Task.findAll().then(function(tasks) {
    fs.writeFile("tasks.txt", JSON.stringify(tasks), function(err) {
        if(err) {
            console.log("error: "+err);
        } else {
            console.log("everything is OK");
        }
    });
}, function(err) {
    console.log("error: "+err);
});

The biggest annoyance is that we are forced to do the successive step (writing the file) inside the callback function, causing us to write pyramid code that is harder to read and tedious to maintain. This is caused by the fact that we can only "chain" a promise to another promise.

Fortunately, we can create a function abstraction that wraps an adapter around any Node.js-style function taking the same parameters (without the callback) that returns a promise:

function promisify(Promise, fun) {
    return function() {
       var args = Array.prototype.slice.call(arguments, 0);
           
       return new Promise(function(resolve, reject) {
            function callback() {
                var args = Array.prototype.slice.call(arguments, 0);
                var err = args[0];
                args.shift();
                    
                if(err) {
                    reject(err);
                } else {
                    resolve(args);
                }
            }
           
            args.push(callback);
                
            fun.apply(null, args);
        });
    };
}

In the above code fragment, we do the following:

  • We define a function that takes two parameters: a Promise prototype that can be used to construct promises and a function representing any Node.js-style function (which the last parameter is a Node.js-style callback).
  • In the function, we construct (and return) a wrapper function that returns a promise.
  • We construct an adapter callback function, that invokes the Promise toolkit's reject() function in case of an error (with the corresponding error object provided by the callback), and resolve() in case of success. In case of success, it simply propagates any result object provided by the Node.js-style callback.
  • Finally, we invoke the Node.js-function with the given function parameters and our adapter callback.

With this function abstraction we can rewrite the earlier example as follows:

Task.findAll().then(function(tasks) {
    return prom2cb.promisify(Promise, fs.writeFile)("tasks.txt", JSON.stringify(tasks));
})
.then(function() {
    console.log("everything is OK");
}, function(err) {
    console.log("error: "+err);
});

as may be observed, we can convert the writeFile() Node.js-style function invocation into an invocation returning a promise, and nicely structure the find and write file invocations by chaining then() invocations.

Conclusions


In this blog post, I have explored two kinds of asynchronous function invocation patterns: Node.js-style and promise-style. You may probably wonder which one I like the most?

I actually hate them both, but I consider promises to be the more powerful of the two because of their composability. However, this comes at a price of doing some extra work to construct them. The most ideal solution to me is still a facility that is part of the language, instead of "forgetting" about existing language constructs and replacing them by custom-made abstractions.

I have also explained that we may have to combine both patterns, which is often quite tedious. Fortunately, we can create function abstractions that convert one into another to ease the pain.

Related work


I am not the first one comparing the function invocation patterns described in this blog post. Parts of this blog post are inspired by a blog post titled: "Callbacks are imperative, promises are functional: Node’s biggest missed opportunity". In this blog post, a comparison between the two invocation styles is done from a programming language paradigm perspective, and is IMO quite interesting to read.

I am also not the first to implement conversion functions between these two styles. For example, promises constructed with the bluebird library implement a method called .asCallback() allowing a user to chain a Node.js-style callback to a promise. Similarly, it provides a function: Promise.promisify() to wrap a Node.js-style function into a function returning a promise.

However, the downside of bluebird is that these facilities can only be used if bluebird is used as a toolkit in an API. Some APIs use different toolkits or construct promises themselves. As explained earlier, Promises/A and Promises/A+ are just interface specifications and only the purpose of then() is defined, whereas the other facilities are extensions.

My function abstractions only make a few assumptions and should work with many implementations. Basically it only requires a proper .then() method (which should be obvious) and a new Promise(function(resolve, reject) { ... }) constructor.

Besides the two function invocation styles covered in this blog post, there are others as well. For example, Zef's blog post titled: "Callback-Free Harmonious Node.js" covers a mechanism called 'Thunks'. In this pattern, an asynchronous function returns a function, which can be invoked to retrieve the corresponding error or result at a later point in time.

References


The two conversion abstractions described in this blog post are part of a package called prom2cb. It can be obtained from my GitHub page and the NPM registry.