Jupyter Notebook Deployment — No more Jupyter(.ipynb) to Python(.py)

Prakash Gupta
5 min readFeb 24, 2021
Deploy Jupyter notebooks directly as Serverless Cloud Function

Jupyter Notebooks are arguably one of the most popular tools used by Data Engineers and Data Scientists worldwide. Data ETLs, machine learning training, experimentation, model testing, model inference — all can be done from Jupyter Notebook interface itself. These notebooks are also excellent at generating visual reports, dashboards and training ML models. While Jupyter Notebook is an awesome IDE for the above tasks, it’s not easy to put these notebooks into some automated pipeline to perform these tasks on a recurring basis. But reports, dashboards and ML models need regular refresh, based on new incoming data.

Often people resort to converting their Jupyter notebooks (.ipynb) files to python script (.py) files which can be deployed in some pipeline and invoked programmatically, in a recurring fashion. One huge drawback in converting ipynb files to python (.py) scripts, apart from the developer effort needed for this conversion, is the need to maintain and manage duplicate code bases.

Papermill solved this problem by allowing one to run a Jupyter Notebooks (ipynb) file as if it’s a python script (py) file. Netflix is a contributor to this project and a big promoter to this idea of using Jupyter Notebooks in ETL and Data pipelines. Papermill supports notebook parameterization, using which we can override the value of any variable used inside the notebook at the time of invoking it. This opens up a whole new way of running our automated ETL job and ML training where the output notebook becomes a one stop immutable record for our cron job with report, dashboard, logs and error messages, all in one place.

Clouderizer supports deploying Jupyter Notebooks as serverless functions using Papermill. No need to convert your ipynb file to python. Any Jupyter Notebook can be deployed to a scalable serverless infrastructure, with just one CLI command and under 2 minutes !

  • No code changes are needed in your Jupyter Notebook.
  • No need to build any docker container. Just give us your notebook.
  • High Memory and Compute support for heavy ETL notebooks.
  • GPU support for Deep Learning notebooks (Nvidia Tesla T4)
  • Automatic dependency detection and installation using pipreqs.
  • Long running operations (default time limit is 60 minutes, can be increased on request)
  • Inputs — Pass parameters to the Jupyter Notebook serverless function (using Papermill Parameterization). Parameters can be text inputs or urls or file inputs (up till 1 GB)
  • Outputs — multiple kinds of notebook outputs are supported.
    Text output (raw text, json, xml, etc) — Just like inputs, we can tag any cell in the notebook as output. Text output from this cell will be returned back in http response.
    File output — In case your model generates any file(s), make sure you put them all in a single directory and use variable outputDir to refer to this directory path. Clouderizer will save all files generated at outputDir, during execution, to an S3 bucket and will return you S3 paths of these generated files in http response.
  • Asynchronous invocation of serverless functions. Specify a callback url in your requests to notify you once execution completes.
  • View detailed invocation history for your deployment. Inputs (text or files) / Outputs (text or files) / Output Notebook, corresponding to each invocation can be viewed and downloaded from the Clouderizer console.
  • Forever free account for non-commercial use with 100 minutes of execution time per month.
  • Pay only when you invoke and execute your function.

*Note: Only python notebook support is in production right now. R notebook support is in beta. Please contact us in case you want early access to it. In case you have requirements for other kernels, please send us your request at info@clouderizer.com.

Examples

Pre-requisites:

  • Active Clouderizer account. Signup a free account here if you don’t have one.
  • Install Clouderizer CLI and login. Read here for detailed instructions.

Example 1:

Deploy a python ETL notebook (etl.ipynb) to clouderizer as a serverless function. Notebook takes one S3 url as input to load the data. This input parameter is used in a variable in your notebook with name input_dataset_url

  1. Parameterize your notebook for the input S3 url as per Papermill guidelines (basically ensuring that the S3 URL variable is in a single cell and tag that cell as parameter)
  2. Type the following command in a terminal window
  3. cldz deploy -n python etl.ipynb
  4. Clouderizer CLI will try to auto detect dependencies from your notebook and give you a preview. If you feel dependencies look ok, press y to continue. If you feel something is missing, you can compile list of dependencies in a requirements.txt and try deploying again with command
  5. cldz deploy -n python etl.ipynb requirements.txt
  6. That’s it. This will deploy your notebook as a serverless function. Packaging and deployment might take a few minutes depending on the dependencies. Once deployment is complete, you will get two URLs for your serverless function
  7. Sync URL — we can use this to invoke our function when we know our notebook takes < 1 min to execute.
  8. Async URL — we can use this to invoke our function when our notebook takes longer to execute.
  9. Example invocation using curl

curl -i -X POST -F input_dataset_url=https://mys3bucket.s3.amazonaws.com/dataset.zip?1234 https://showcase.clouderizer.com/api/async-function/clouderizerdemo-etl/notebook

Above is an async invocation. It immediately returns back with 202 Accepted http response code. One can login to Clouderizer console to see the invocation progress of this request. Once this function execution is complete, the output notebook is available for viewing and download.

Example 2:

Deploy a tensorflow deep learning notebook (tf_deeplearning.ipynb) as a serverless function with GPU support. Deployment will take input an S3 url for input dataset and an integer for batch size. Both inputs are defined as variables in the notebook with names input_dataset_url and batch_size. Notebook also generates a model file exported to local folder path with variable outputDir

  1. Parameterize your notebook for the inputs as per Papermill guidelines (basically ensuring that all input variables are in a single cell and tag that cell as parameter)
  2. Type the following command in a terminal window
  3. cldz deploy -n python tf_deeplearning.ipynb –infra gpu
  4. *Note the infra option for GPU deployment
  5. Clouderizer CLI will try to auto detect dependencies from your notebook and give you a preview. If you feel dependencies look ok, press y to continue. If you feel something is missing, you can compile list of dependencies in a requirements.txt and try deploying again with command
  6. cldz deploy -n python tf_deeplearning.ipynb requirements.txt –infra gpu
  7. That’s it. This will deploy your notebook as a serverless function. Packaging and deployment might take a few minutes depending on the dependencies. Once deployment is complete, you will get two URLs for your serverless function
  8. Sync URL — we can use this to invoke our function when we know our notebook takes < 1 min to execute.
  9. Async URL — we can use this to invoke our function when our notebook takes longer to execute.
  10. Example invocation using curl

curl -i -X POST -F input_dataset_url=https://mys3bucket.s3.amazonaws.com/dataset.zip?1234 -F batch_size=128 https://showcase.clouderizer.com/api/async-function/clouderizerdemo-tf-deeplearning/notebook -H X-Callback-Url https://mywebhook.com

*Note the callback url provided in the http header.

Above is an async invocation. It immediately returns back with 202 Accepted http response code. Once execution is complete, the callback url specified in the request is called with the http result. This result will contain the S3 url of the model file generated during notebook invocation.

--

--

Prakash Gupta

Passionate about designing and developing scalable software platforms.