This document describes how to setup all the dependencies to run the notebooks in this repository in following platforms:
- Local (Linux, MacOS or Windows) or DSVM (Linux or Windows)
- Azure Databricks
- Docker container
- Compute environments
- Setup guide for Local or DSVM
- Setup guide for Azure Databricks
- Install the utilities via PIP
- Setup guide for Docker
Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements. Currently, this repository supports Python CPU, Python GPU and PySpark.
- A machine running Linux, MacOS or Windows
- Anaconda with Python version >= 3.6
- This is pre-installed on Azure DSVM such that one can run the following steps directly. To setup on your local machine, Miniconda is a quick way to get started.
- Apache Spark (this is only needed for the PySpark environment).
As a pre-requisite to install the dependencies with Conda, make sure that Anaconda and the package manager Conda are both up to date:
conda update conda -n root
conda update anaconda # use 'conda install anaconda' if the package is not installed
We provide a script, generate_conda_file.py, to generate a conda-environment yaml file which you can use to create the target environment using the Python version 3.6 with all the correct dependencies.
NOTE the xlearn
package has dependency on cmake
. If one uses the xlearn
related notebooks or scripts, make sure cmake
is installed in the system. The easiest way to install on Linux is with apt-get: sudo apt-get install -y build-essential cmake
. Detailed instructions for installing cmake
from source can be found here.
Assuming the repo is cloned as Recommenders
in the local system, to install a default (Python CPU) environment:
cd Recommenders
python tools/generate_conda_file.py
conda env create -f reco_base.yaml
You can specify the environment name as well with the flag -n
.
Click on the following menus to see how to install Python GPU and PySpark environments:
Python GPU environment
Assuming that you have a GPU machine, to install the Python GPU environment:
cd Recommenders
python tools/generate_conda_file.py --gpu
conda env create -f reco_gpu.yaml
PySpark environment
To install the PySpark environment:
cd Recommenders
python tools/generate_conda_file.py --pyspark
conda env create -f reco_pyspark.yaml
Additionally, if you want to test a particular version of spark, you may pass the --pyspark-version argument:
python tools/generate_conda_file.py --pyspark-version 2.4.5
Then, we need to set the environment variables PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
to point to the conda python executable.
Click on the following menus to see details:
Set PySpark environment variables on Linux or MacOS
To set these variables every time the environment is activated, we can follow the steps of this guide.
First, get the path of the environment reco_pyspark
is installed:
RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
mkdir -p $RECO_ENV/etc/conda/activate.d
mkdir -p $RECO_ENV/etc/conda/deactivate.d
You also need to find where Spark is installed and set SPARK_HOME
variable, on the DSVM, SPARK_HOME=/dsvm/tools/spark/current
.
Then, create the file $RECO_ENV/etc/conda/activate.d/env_vars.sh
and add:
#!/bin/sh
RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
export PYSPARK_PYTHON=$RECO_ENV/bin/python
export PYSPARK_DRIVER_PYTHON=$RECO_ENV/bin/python
export SPARK_HOME=/dsvm/tools/spark/current
This will export the variables every time we do conda activate reco_pyspark
. To unset these variables when we deactivate the environment, create the file $RECO_ENV/etc/conda/deactivate.d/env_vars.sh
and add:
#!/bin/sh
unset PYSPARK_PYTHON
unset PYSPARK_DRIVER_PYTHON
Set PySpark environment variables on Windows
To set these variables every time the environment is activated, we can follow the steps of this guide.
First, get the path of the environment reco_pyspark
is installed:
for /f "delims=" %A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%A"
Then, create the file %RECO_ENV%\etc\conda\activate.d\env_vars.bat
and add:
@echo off
for /f "delims=" %%A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%%A"
set PYSPARK_PYTHON=%RECO_ENV%\python.exe
set PYSPARK_DRIVER_PYTHON=%RECO_ENV%\python.exe
set SPARK_HOME_BACKUP=%SPARK_HOME%
set SPARK_HOME=
set PYTHONPATH_BACKUP=%PYTHONPATH%
set PYTHONPATH=
This will export the variables every time we do conda activate reco_pyspark
.
To unset these variables when we deactivate the environment,
create the file %RECO_ENV%\etc\conda\deactivate.d\env_vars.bat
and add:
@echo off
set PYSPARK_PYTHON=
set PYSPARK_DRIVER_PYTHON=
set SPARK_HOME=%SPARK_HOME_BACKUP%
set SPARK_HOME_BACKUP=
set PYTHONPATH=%PYTHONPATH_BACKUP%
set PYTHONPATH_BACKUP=
Full (PySpark & Python GPU) environment
With this environment, you can run both PySpark and Python GPU notebooks in this repository. To install the environment:
cd Recommenders
python tools/generate_conda_file.py --gpu --pyspark
conda env create -f reco_full.yaml
Then, we need to set the environment variables PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
to point to the conda python executable.
See PySpark environment setup section for the details about how to setup those variables.
where you will need to change reco_pyspark
string in the commands to reco_full
.
We can register our created conda environment to appear as a kernel in the Jupyter notebooks.
conda activate my_env_name
python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)"
If you are using the DSVM, you can connect to JupyterHub by browsing to https://your-vm-ip:8000
.
-
We found that there can be problems if the Spark version of the machine is not the same as the one in the conda file. You can use the option
--pyspark-version
to address this issue. -
When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at
/dsvm/tools/spark/current/conf/spark-env.sh
.
SPARK_LOCAL_DIRS="/mnt"
SPARK_WORKER_DIR="/mnt"
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"
-
Another source of problems is when the variable
SPARK_HOME
is not set correctly. In the Azure DSVM,SPARK_HOME
should be/dsvm/tools/spark/current
. -
Java 11 might produce errors when running the notebooks. To change it to Java 8:
sudo apt install openjdk-8-jdk
sudo update-alternatives --config java
- Databricks Runtime version >= 4.3 (Apache Spark 2.3.1, Scala 2.11) and <= 5.5 (Apache Spark 2.4.3, Scala 2.11)
- Python 3
An example of how to create an Azure Databricks workspace and an Apache Spark cluster within the workspace can be found from here. To utilize deep learning models and GPUs, you may setup GPU-enabled cluster. For more details about this topic, please see Azure Databricks deep learning guide.
You can setup the repository as a library on Databricks either manually or by running an installation script. Both options assume you have access to a provisioned Databricks workspace and cluster and that you have appropriate permissions to install libraries.
Quick install
This option utilizes an installation script to do the setup, and it requires additional dependencies in the environment used to execute the script.
To run the script, following prerequisites are required:
Setup CLI authentication for Azure Databricks CLI (command-line interface). Please find details about how to create a token and set authentication here. Very briefly, you can install and configure your environment with the following commands.
conda activate reco_pyspark databricks configure --token
Get the target cluster id and start the cluster if its status is TERMINATED.
- You can get the cluster id from the databricks CLI with:
databricks clusters list
- If required, you can start the cluster with:
databricks clusters start --cluster-id <CLUSTER_ID>`
The installation script has a number of options that can also deal with different databricks-cli profiles, install a version of the mmlspark library, overwrite the libraries, or prepare the cluster for operationalization. For all options, please see:
python tools/databricks_install.py -h
Once you have confirmed the databricks cluster is RUNNING, install the modules within this repository with the following commands.
cd Recommenders
python tools/databricks_install.py <CLUSTER_ID>
Note If you are planning on running through the sample code for operationalization here, you need to prepare the cluster for operationalization. You can do so by adding an additional option to the script run. <CLUSTER_ID> is the same as that mentioned above, and can be identified by running databricks clusters list
and selecting the appropriate cluster.
python tools/databricks_install.py --prepare-o16n <CLUSTER_ID>
See below for details.
Manual setup
To install the repo manually onto Databricks, follow the steps:
-
Clone the Microsoft Recommenders repository to your local computer.
-
Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the
.egg
suffix, so we don't use the standard.zip
):cd Recommenders zip -r Recommenders.egg .
-
Once your cluster has started, go to the Databricks workspace, and select the
Home
button. -
Your
Home
directory should appear in a panel. Right click within your directory, and selectImport
. -
In the pop-up window, there is an option to import a library, where it says:
(To import a library, such as a jar or egg, click here)
. Selectclick here
. -
In the next screen, select the option
Upload Python Egg or PyPI
in the first menu. -
Next, click on the box that contains the text
Drop library egg here to upload
and use the file selector to choose theRecommenders.egg
file you just created, and selectOpen
. -
Click on the
Create library
. This will upload the egg and make it available in your workspace. -
Finally, in the next menu, attach the library to your cluster.
After installation, you can now create a new notebook and import the utilities from Databricks in order to confirm that the import worked.
import reco_utils
- For the reco_utils import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work.
This repository includes an end-to-end example notebook that uses Azure Databricks to estimate a recommendation model using matrix factorization with Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that notebook, you must install the Recommenders repository as a library (as described above), AND you must also install some additional dependencies. With the Quick install method, you just need to pass an additional option to the installation script.
Quick install
This option utilizes the installation script to do the setup. Just run the installation script
with an additional option. If you have already run the script once to upload and install the Recommenders.egg
library, you can also add an --overwrite
option:
python tools/databricks_install.py --overwrite --prepare-o16n <CLUSTER_ID>
This script does all of the steps described in the Manual setup section below.
Manual setup
You must install three packages as libraries from PyPI:
azure-cli==2.0.56
azureml-sdk[databricks]==1.0.8
pydocumentdb==2.3.3
You can follow instructions here for details on how to install packages from PyPI.
Additionally, you must install the spark-cosmosdb connector on the cluster. The easiest way to manually do that is to:
- Download the appropriate jar from MAVEN. NOTE This is the appropriate jar for spark versions
2.3.X
, and is the appropriate version for the recommended Azure Databricks run-time detailed above. - Upload and install the jar by:
- Log into your
Azure Databricks
workspace - Select the
Clusters
button on the left. - Select the cluster on which you want to import the library.
- Select the
Upload
andJar
options, and click in the box that has the textDrop JAR here
in it. - Navigate to the downloaded
.jar
file, select it, and clickOpen
. - Click on
Install
. - Restart the cluster.
- Log into your
A setup.py file is provided in order to simplify the installation of the utilities in this repo from the main directory.
This still requires the conda environment to be installed as described above. Once the necessary dependencies are installed, you can use the following command to install reco_utils
as a python package.
pip install -e .
It is also possible to install directly from GitHub. Or from a specific branch as well.
pip install -e git+https://github.com/microsoft/recommenders/#egg=pkg
pip install -e git+https://github.com/microsoft/recommenders/@staging#egg=pkg
NOTE - The pip installation does not install any of the necessary package dependencies, it is expected that conda will be used as shown above to setup the environment for the utilities being used.
A Dockerfile is provided to build images of the repository to simplify setup for different environments. You will need Docker Engine installed on your system.
Note: docker
is already available on Azure Data Science Virtual Machine
See guidelines in the Docker README for detailed instructions of how to build and run images for different environments.
Example command to build and run Docker image with base CPU environment.
DOCKER_BUILDKIT=1 docker build -t recommenders:cpu --build-arg ENV="cpu" .
docker run -p 8888:8888 -d recommenders:cpu
You can then open the Jupyter notebook server at http://localhost:8888