Skip to content

Latest commit

 

History

History
303 lines (191 loc) · 15.5 KB

README.md

File metadata and controls

303 lines (191 loc) · 15.5 KB

pkglink

Space saving Node.js package hard linker.

pkglink locates common JavaScript/Node.js packages from your node_modules directories and hard links the package files so they share disk space.

Build Status Known Vulnerabilities

demo

Why?

As an instructor, I create lots of JavaScript and Node.js projects and many of them use the same packages. However due to the way packages are installed they all take up their own disk space. It would be nice to have a way for the installations of the same package to share disk space.

Modern operating systems and disk formats support the concept of hard links which is a way to have one copy of a file on disk that can be used from multiple paths. Since packages are generally read-only once they are installed, it would save much disk space if we could hard link their files.

pkglink is a command line tool that searches directory tree that you specify for packages in your node_modules directories. When it finds matching packages of the same name and version that could share space, it hard links the files. As a safety precaution it checks many file attributes before considering them for linking (see full details later in this doc).

pkglink keeps track of packages it has seen on previous scans so when you run on new directories in the future, it can quickly know where to look for previous package matches. It double checks the previous packages are still the proper version, inode, and modified time before linking, but this prevents performing full tree scans any time you add a new project. Simply run pkglink once on your project tree and then again on new projects as you create them.

pkglink has been tested on Ubuntu, Mac OS X, and Windows. Hard links are supported on most modern disk formats with the exception of FAT and ReFS.

How much savings?

It all depends on how many matching packages you have on your system, but you will probably be surprised.

After running pkglink on my project directories, it found 128K packages and saved over 20GB of disk space.

Assumptions for use

The main assumption that enables hard linking is that you are not manually modifying your packages after install from the registry. This means that installed packages of the same name and version should generally be the same. Additional checks at the file level are used to verify matches (see filter criteria later in this doc) before selecting them for linking.

Before running any tool that can modify your file system it is always a good idea to have a current backup and sync code with your repositories.

Hard linking will not work on FAT and ReFS file systems. Hard links can only be made between files on the same device (drive). pkglink has been tested on Mac OS X (hpfs), Ubuntu (ext4), and Windows (NTFS).

If you had to recover from an unforeseen defect in pkglink, the recovery process is to simply delete your project's node_modules directory and perform npm install again.

Installation

npm install -g pkglink

Quick start

To find and hard link matching packages

To hard link packages just run pkglink with one or more directory trees that you wish it to scan and link.

pkglink DIR1 DIR2 ...

You will get output similar to this:

jeffbski-laptop:~$ pkglink ~/projects ~/working

pkgs: 128,383 saved: 5.11GB

The run above indicated that pkglink found 128K packages and after linking it saved over 5GB of disk space. (Actual savings was higher since I had run pkglink on a portion of the tree previously)

Dryrun - just output a list of matching packages

If you wish to see what packages pkglink would link you can use the --dryrun or -d option. pkglink will output matching packages that it would normally link but it will NOT perform any linking.

pkglink -d DIR1 DIR2 ...

The --dryrun output looks like:

jeffbski-laptop:~$ pkglink -d ~/working/expect-test

tmatch-2.0.1
  /Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/tmatch
  /Users/jeff/working/expect-test/node_modules/tmatch

object.entries-1.0.3
  /Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/object.entries
  /Users/jeff/working/expect-test/node_modules/object.entries

object-keys-1.0.11
  /Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/object-keys
  /Users/jeff/working/expect-test/node_modules/object-keys

# pkgs: 21 would save: 3.88MB

Generate link commands only

If you want to see exactly what it would be linking down to the file level, you can use the --gen-ln-cmds or -g option and it will output the equivalent bash commands for the hard links that it would normally create. It will not peform the linking. You can view this for correctness or even save it to a file and excute it with bash besides just running pkglink again wihout the -g option.

pkglink -g DIR1 DIR2 ...

The --gen-ln-cmds output looks like

jeffbski-laptop:~$ pkglink -g ~/working/expect-test

ln -f "/Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/define-properties/index.js" "/Users/jeff/working/expect-test/node_modules/define-properties/index.js"
ln -f "/Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/expect/CHANGES.md" "/Users/jeff/working/expect-test/node_modules/expect/CHANGES.md"
ln -f "/Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/expect/LICENSE.md" "/Users/jeff/working/expect-test/node_modules/expect/LICENSE.md"
ln -f "/Users/jeff/projects/pkglink/fixtures/projects/foo1/node_modules/es-abstract/Makefile" "/Users/jeff/working/expect-test/node_modules/es-abstract/Makefile"
# pkgs: 21 would save: 3.88MB

Full Usage

Usage: pkglink {OPTIONS} [dir] [dirN]

Description:

     pkglink - Space saving Node.js package hard linker

     pkglink recursively searches directories for Node.js packages
     installed in node_modules directories. It uses the package name
     and version to match up possible packages to share. Once it finds
     similar packages, pkglink walks through the package directory tree
     checking for files that can be linked. If each file's modified
     datetime and size match, it will create a hard link for that file
     to save disk space. (On win32, mtimes are inconsistent and ignored)

     It keeps track of modules linked in ~/.pkglink_refs to quickly
     locate similar modules on future runs. The refs are always
     double checked before being considered for linking. This makes
     it convenient to perform future pkglink runs on new directories
     without having to reprocess the old.

Standard Options:

 -c, --config CONFIG_PATH

  This option overrides the config file path, default ~/.pkglink

 -d, --dryrun

  Instead of performing the linking, just display the modules that
  would be linked and the amount of disk space that would be saved.

 -g, --gen-ln-cmds

  Instead of performing the linking, just generate link commands
  that the system would perform and output

 -h, --help

  Show this message

 -m, --memory MEMORY_MB

  Run with increased or decreased memory specified in MB, overrides
  environment variable PKGLINK_NODE_OPTIONS and config.memory
  The default memory used is 2560.

 -p, --prune

  Prune the refs file by checking all of the refs clearing out any
  that have changed

 -r, --refs-file REFS_FILE_PATH

  Specify where to load and store the link refs file which is used to
  quickly locate previously linked modules. Default ~/pkglink_refs.json

 -t, --tree-depth N

  Maximum depth to search the directories specified for packages
  Default depth: 0 (unlimited)

 -v, --verbose

  Output additional information helpful for debugging

If your machine has less than 2.5GB of memory you can use pkglink_low instead of pkglink and it will run with the normal 1.5GB memory default.

Config

The default config file path is ~/.pkglink unless you override it with the --config command line option. If this file exists it should be a JSON file with an object having any of the following properties.

  • refsFile - location of the JSON file used to track the last 5 references to each package it finds, default: ~/.pkglink_refs. This can also be overridden with the --refs-file command line argument.

  • concurrentOps - the number of concurrent operations allowed for IO operations, default: 4

  • consoleWidth - the number of columns in your console, default: 70

  • ignoreModTime - ignore the modification time of the files, default is true on Windows, otherwise false

  • memory - adjust the memory used in MB, default: 2560 (2.5GB). Can also be overridden by setting environment variable PKGLINK_NODE_OPTIONS=--max-old-space-size=1234 or by using the command line argument --memory.

  • minFileSize - the minimum size file to consider for linking in bytes, default: 0

  • refSize - number of package refs to keep in the refsFile which is used to find matching packages on successive runs, default: 5

  • tree-depth - the maximum depth to search the directories for packages, default: 0 (unlimited). Can also be overridden with --tree-depth command line option.

How do I know it is working?

Well if you check your disk space before and after a run it should be at least as much savings as pkglink indicates during a run. pkglink indicates the file size saved, but the actual savings can be greater due to the block size of the disk.

On systems with bash, you can also use ls -ali node_modules/XYZ to see the number of hard links a particular file has (which is the number of times it is shared) and the actual inode values.

When using the -i option with ls the first column is the inode of the file, so you can verify one directories' files with another. Also the 3rd column is the number of hard links, so you can see that CHANGELOG.md, LICENSE, README.md, and index.js all have 17 hard links.

jeffbski-laptop:~/working/expect-test$ ls -ali node_modules/define-properties/
total 80
89543426 drwxr-xr-x  13 jeff  staff   442 Oct 22 04:02 .
89543425 drwxr-xr-x  24 jeff  staff   816 Oct 22 03:58 ..
89543473 -rw-r--r--   1 jeff  staff   276 Oct 14  2015 .editorconfig
89543474 -rw-r--r--   1 jeff  staff   156 Oct 14  2015 .eslintrc
89543475 -rw-r--r--   1 jeff  staff  3062 Oct 14  2015 .jscs.json
89543476 -rw-r--r--   1 jeff  staff     8 Oct 14  2015 .npmignore
89543477 -rw-r--r--   1 jeff  staff  1182 Oct 14  2015 .travis.yml
89212049 -rw-r--r--  17 jeff  staff   972 Oct 14  2015 CHANGELOG.md
89212004 -rw-r--r--  17 jeff  staff  1080 Oct 14  2015 LICENSE
89211984 -rw-r--r--  17 jeff  staff  2725 Oct 14  2015 README.md
89212027 -rw-r--r--  17 jeff  staff  1560 Oct 14  2015 index.js
89543482 -rw-r--r--   1 jeff  staff  1593 Oct 14  2015 package.json
89543447 drwxr-xr-x   3 jeff  staff   102 Oct 22 04:02 test

What files will it link in the packages

pkglink looks for packages in the node_modules directories of the directory trees that you specify as args on the command line.

To be considered for linking the following criteria are checked:

  • package name and version from package.json must match
  • package.json is excluded from linking since npm often modifies it on install
  • files are on the same device (drive) - hard links only work on same device
  • files are not already the same inode (not already hard linked)
  • file size is the same
  • file modified time is the same (except on Windows which doesn't maintain the original modified times during npm installs)
  • file size is >= to config.minFileSize (defaults to 0 to include all)
  • directories starting with a . and all their descendents are ignored

FAQ

Q. Can I run this for a single project?

Yes, pkglink is designed so that you can run it for individual projects or for a whole directory tree. It keeps track of packages it has already seen on previous runs (in its refs file) so it can perform links with those as well as any duplication in your project.

Q. Once I use this do I need to do anything special when deleting or updating projects?

No, since pkglink works by using hard links, your operating system will handle things appropriately under the covers. The OS updates the link count when packages are deleted from a particular path. If you update or reinstall then your packages will simply replace those that were there. You could run pkglink on the project again to hard link the new files.

Also while pkglink keeps a list of packages it has found in its refs file (~/.pkglink_refs), it always double checks packages before using them for linking (and it updates the refs file). You may also run pkglink with the --prune option to check all the refs.

Q. Can I interrupt pkglink during its run?

Yes, type Control-c once and pkglink will cancel its processing and shutdown. Please allow time for it to gracefully shutdown.

Q. What does the output mean?

jeffbski-laptop:~$ pkglink ~/projects ~/working

pkgs: 128,383 saved: 5.11GB

For this pkglink found 128K packages and after performing linking it saved over 5GB of space. pkglink reports the total of the file size saved, but the actual savings on disk is likely larger due to drive block sizes. Using df -H before and after the run, the actual size saved was around 11GB.

Since I had already run pkglink on portions of this tree, this was only the additional savings gained. I had already linked another 8GB previously so my total link savings was closer to 20GB.

If you were to run pkglink again immediately after this previous run it will come back with the same pkg count but the savings reported this time would be 0 since everything had been linked previously.

Q. What do I do if I get an out of memory error?

If you run pkglink on a really large directory tree, you might get an out of memory error during the run.

The error might look something like:

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

You can either run pkglink on smaller portions of the tree at a time or you can allow pkglink to use more memory for its run. You can do this by using the --memory or -m option or changing the memory config option in the ~/.pgklink JSON file.

By default pkglink runs with 2.5GB of memory, so to increase it to 4GB, you could use the following command:

pkglink -m 4096 DIR1 DIR2 ...

If you don't even have 2.5GB of memory, you can use the low memory version of pkglink, pkglink_low DIR1 DIR2 ... and it will just run with the node.js defaults. Note that you may need to run pkglink_low on smaller portions of the directory tree at a time.

Recovering from an unforeseen problem

If you need to recover from a problem the standard way is to simply delete your project's node_modules directory and run npm install again.

If pkglink exits early, failing to give you the summary output or if you get an out of memory error, see the FAQ above about handling out of memory errors. You can run pkglink on smaller directory trees at a time or increase the memory available to it.

License

MIT license

Credits

This project was born out of discussions between @kevinold and @jeffbski at Strange Loop 2016.

CodeWinds Training sponsored the development of this project.