Synapse vs Databricks: A Comparison
As a Data Platform Architect/ Engineer working with several clients in Finland, I have extensive experience using Azure Databricks and Azure Data Factory (for notebook orchestration). Recently, however, one of my clients made the decision to switch to Azure Synapse Analytics. In this post, I will share my journey of transitioning from Databricks to Synapse and provide insights that may help you make a more informed decision if you are considering either of these platforms.
When it comes to choosing between Synapse and Databricks for your data processing needs, there are several factors to consider. Firstly, we will take a closer look at some of the key features of each platform and then finally my opinion on the matter.
Data Storage, Resource Access, and DevOps Integration
When comparing Databricks and Synapse, it is important to consider the availability of certain features. For example, Databricks allows you to use multiple notebooks within the same session – a feature that is not currently available in Synapse. Another key difference between the two platforms is the way they handle data storage. Databricks provides a static mount path for your storage accounts, making it easy to navigate through your data like a traditional filesystem. In contrast, Synapse requires you to provide a ‘job id’ when reading data from a mount – an id that changes every time a new job is run.
When it comes to accessing resources, Synapse offers linked service access management – a feature that allows for cleaner and more manageable connections between different services via Azure. In contrast, Databricks relies on tokens generated by service principals for resource access. However, Databricks does have an advantage when it comes to bootup time – boasting faster speeds than Synapse. On the other hand, Synapse has better DevOps integration compared to Databricks.
Features, Performance and Use Cases
There are several other key differences between Databricks and Synapse that are worth considering. For example, Databricks currently offers more features and better performance optimizations than Synapse. However, for data platforms that primarily use SQL and have few Spark use cases, Synapse Analytics may be the better choice. Synapse has an open-source version of Spark with built-in support for .NET applications, while Databricks has an optimized version of Spark that offers increased performance. Additionally, Databricks allows users to select GPU-enabled clusters for faster data processing and higher concurrency.
User Experience
In terms of user experience, Synapse has a traditional SQL engine that may feel more familiar to BI developers. It also has a Spark engine for use by data scientists and analysts. In contrast, Databricks is a Spark-based notebook tool with a focus on Spark functionality. Synapse currently only offers hive metadata GUI but with Unity Catalog, Databricks takes it to another level of creating the metadata hierarchy.
Managing Workflows with External Orchestration Tools
One important aspect to understand when using notebooks in Databricks is the lack of an in-built orchestration tool or service. While it is possible to schedule jobs in Databricks, the functionality is quite basic. For this reason, in many projects we used Azure Data Factory to orchestrate Databricks notebooks. In a recent Databricks meetup, one participant mentioned using Apache Airflow for orchestration on AWS – though I am not sure about GCP. This is a crucial point to consider because Synapse bundles everything under one umbrella for seamless integration. Until Databricks produces an alternative solution, you will need to use it alongside ADF (Azure Data Factory) or Synapse for orchestration.
Feature | Databricks | Azure Synapse Analytics |
Multiple notebooks within same session | Yes | No |
Data storage handling | Static mount path for storage accounts | Requires ‘job id’ when reading data from a mount |
Resource access management | Tokens generated by service principals | Linked Service access management |
Bootup time | Faster speeds than Synapse | Slower speeds than Databricks |
DevOps integration | Less integration compared to Synapse | Better integration compared to Databricks |
Features and performance optimizations | More features and better performance optimizations than Synapse | Fewer features and less performance optimizations than Databricks |
SQL support | Less support for SQL use cases | Better support for SQL use cases |
Spark engine | Optimized version of Spark that offers increased performance | Open-source version of Spark with built-in support for .NET applications |
GPU-enabled clusters | Allows users to select GPU-enabled clusters for faster data processing and higher concurrency | Not available in Synapse now. |
User experience | Spark-based notebook tool with a focus on Spark functionality | Traditional SQL engine that may feel more familiar to BI developers. Also has a Spark engine for use by data scientists and analysts. |
Real-time Co-Authoring | Databricks Notebooks has as real-time co-authoring (both authors see the changes in real-time) | Synapse Notebooks has co-authoring of Notebooks, but one person needs to save the Notebook before another person sees the change |
Orchestration tool or service | Lacks an in-built orchestration tool or service. Needs to be used alongside ADF or Synapse for orchestration. | Bundles everything under one umbrella for seamless integration. |
Choosing Between Databricks and Synapse: Which One Is Right for You?
Ultimately, the choice between these two platforms will depend on your specific needs and priorities. Nah! I will not leave you with a diplomatic answer. In my opinion (could be controversial based on your cloud bias and when are you reading this) if your infra is on AWS/GCP, your priority is data processing efficiency and access to latest spark and delta features go for Databricks.
On the other hand, if your infrastructure is primarily based on Azure and your use case involves data preparation for a data platform with data modeling on a Datalake (reach out if you are interested to know how), then Azure Synapse may be the better choice. Synapse has more features in development for future releases – something that has not been announced by Databricks yet. Good luck! And stay tuned for upcoming series focusing on ML, streaming, delta and partitioning.