Databricks Project Workspaces Overview
Audience: Project members
Content Summary: This page explains Databricks workspaces, which allow users to access and write to protected data directly in Databricks.
See the Pre-Configuration Checklist for details on prerequisites and see the Configuration page for installation instructions.
Overview
Databricks project workspaces allow users to access data on cluster without having to go through the Immuta SparkSession. Using Immuta Projects and Project Equalization, Databricks project workspaces are a space where every project member has the same level of access to data. This equalized access allows collaboration without worries about data leaks. Not only can project members collaborate on data, but they can also write protected data back to Immuta.
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.
Supported Cloud Providers
Amazon Web Services
Immuta currently supports the s3a
schema for Amazon S3. When using Databricks on Amazon S3 either a key pair for S3
needs to be specified in the additional configuration that has access to the workspace bucket/prefix or an instance
role must be applied to the cluster with access.
Microsoft Azure
Immuta currently supports the abfss
schema for Azure General Purpose V2 Storage Accounts. this includes support for
Azure Data Lake Gen 2. When configuring Immuta workspaces for Databricks on Azure, the Azure Databricks Workspace ID
must be provided. More information about how to determine the Workspace ID for your workspace can be found in the
Databricks documentation. It is also important that the
additional configuration file is included on any clusters that wish to use Immuta workspaces with credentials for the
container in Azure Storage that contains Immuta workspaces.
Google Cloud Platform
Immuta currently supports the gs
schema for Google Cloud Platform. The primary difference between Databricks on
Google Cloud Platform and Databricks on AWS or Azure is that it is deployed to Google Kubernetes Engine. Databricks
handles
automatically provisioning and auto scaling drivers and executors to pods on Google Kubernetes Engine, so Google Cloud
Platform admin users can view and monitor the Google Kubernetes resources in the Google Cloud Platform.
Caveats and Limitations
- Stage Immuta installation artifacts in Google Storage, not DBFS: The DBFS FUSE mount is unavailable, and the
IMMUTA_SPARK_DATABRICKS_DBFS_MOUNT_ENABLED
property cannot be set totrue
to expose the DBFS FUSE mount. - Stage the Immuta init script in Google Storage: Init scripts in DBFS are not supported.
- Stage third party libraries in DBFS: Installing libraries from Google Storage is not supported.
- Install third party libraries as cluster-scoped: Notebook-scoped libraries have limited support. See the Databricks Libraries page for more details.
- Maven library installation is only supported in Databricks Runtime 8.1+.
-
/databricks/spark/conf/spark-env.sh
is mounted as read-only:- Set sensitive Immuta configuration values directly in
immuta_conf.xml
: Do not use environment variables to set sensitive Immuta properties. Immuta is unable to edit thespark-env.sh
file because it is read-only; therefore, remove environment variables and keep them from being visible to end users. - Use
/immuta-scratch
directly: TheIMMUTA_LOCAL_SCRATCH_DIR
property is unavailable.
- Set sensitive Immuta configuration values directly in
-
Allow the Kubernetes resource to spin down before submitting another job: Job clusters with init scripts fail on subsequent runs.
- The DBFS CLI is unavailable: Other non-DBFS Databricks CLI functions will still work as expected.
Writing Data Back to Databricks: Supported Metastore Providers
To write data back to a table in Databricks through an Immuta workspace, use one of the following supported provider types for your table format:
avro
csv
delta
orc
parquet