Synapse vs Databricks: A Comparison 

Synapse vs Databricks: A Comparison 

Synapse vs Databricks: A Comparison

As a Data Platform Architect/ Engineer working with several clients in Finland, I have extensive experience using Azure Databricks and Azure Data Factory (for notebook orchestration). Recently, however, one of my clients made the decision to switch to Azure Synapse Analytics. In this post, I will share my journey of transitioning from Databricks to Synapse and provide insights that may help you make a more informed decision if you are considering either of these platforms. 

When it comes to choosing between Synapse and Databricks for your data processing needs, there are several factors to consider. Firstly, we will take a closer look at some of the key features of each platform and then finally my opinion on the matter.

Data Storage, Resource Access, and DevOps Integration

When comparing Databricks and Synapse, it is important to consider the availability of certain features. For example, Databricks allows you to use multiple notebooks within the same session – a feature that is not currently available in Synapse. Another key difference between the two platforms is the way they handle data storage. Databricks provides a static mount path for your storage accounts, making it easy to navigate through your data like a traditional filesystem. In contrast, Synapse requires you to provide a ‘job id’ when reading data from a mount – an id that changes every time a new job is run. 

When it comes to accessing resources, Synapse offers linked service access management – a feature that allows for cleaner and more manageable connections between different services via Azure. In contrast, Databricks relies on tokens generated by service principals for resource access. However, Databricks does have an advantage when it comes to bootup time – boasting faster speeds than Synapse. On the other hand, Synapse has better DevOps integration compared to Databricks.

Features, Performance and Use Cases

There are several other key differences between Databricks and Synapse that are worth considering. For example, Databricks currently offers more features and better performance optimizations than Synapse. However, for data platforms that primarily use SQL and have few Spark use cases, Synapse Analytics may be the better choice. Synapse has an open-source version of Spark with built-in support for .NET applications, while Databricks has an optimized version of Spark that offers increased performance. Additionally, Databricks allows users to select GPU-enabled clusters for faster data processing and higher concurrency.

User Experience

In terms of user experience, Synapse has a traditional SQL engine that may feel more familiar to BI developers. It also has a Spark engine for use by data scientists and analysts. In contrast, Databricks is a Spark-based notebook tool with a focus on Spark functionality. Synapse currently only offers hive metadata GUI but with Unity Catalog, Databricks takes it to another level of creating the metadata hierarchy.

Managing Workflows with External Orchestration Tools

One important aspect to understand when using notebooks in Databricks is the lack of an in-built orchestration tool or service. While it is possible to schedule jobs in Databricks, the functionality is quite basic. For this reason, in many projects we used Azure Data Factory to orchestrate Databricks notebooks. In a recent Databricks meetup, one participant mentioned using Apache Airflow for orchestration on AWS – though I am not sure about GCP. This is a crucial point to consider because Synapse bundles everything under one umbrella for seamless integration. Until Databricks produces an alternative solution, you will need to use it alongside ADF (Azure Data Factory) or Synapse for orchestration.  

Feature Databricks Azure Synapse Analytics 
Multiple notebooks within same session Yes No 
Data storage handling Static mount path for storage accounts Requires ‘job id’ when reading data from a mount 
Resource access management Tokens generated by service principals Linked Service access management 
Bootup time Faster speeds than Synapse Slower speeds than Databricks 
DevOps integration Less integration compared to Synapse Better integration compared to Databricks 
Features and performance optimizations More features and better performance optimizations than Synapse Fewer features and less performance optimizations than Databricks 
SQL support Less support for SQL use cases Better support for SQL use cases 
Spark engine Optimized version of Spark that offers increased performance Open-source version of Spark with built-in support for .NET applications 
GPU-enabled clusters Allows users to select GPU-enabled clusters for faster data processing and higher concurrency Not available in Synapse now. 
User experience Spark-based notebook tool with a focus on Spark functionality Traditional SQL engine that may feel more familiar to BI developers. Also has a Spark engine for use by data scientists and analysts.  
Real-time Co-Authoring Databricks Notebooks has as real-time co-authoring (both authors see the changes in real-time) Synapse Notebooks has co-authoring of Notebooks, but one person needs to save the Notebook before another person sees the change 
Orchestration tool or service Lacks an in-built orchestration tool or service. Needs to be used alongside ADF or Synapse for orchestration. Bundles everything under one umbrella for seamless integration. 
Synapse vs Databricks feature comparison summary table.

Choosing Between Databricks and Synapse: Which One Is Right for You?

Ultimately, the choice between these two platforms will depend on your specific needs and priorities. Nah! I will not leave you with a diplomatic answer. In my opinion (could be controversial based on your cloud bias and when are you reading this) if your infra is on AWS/GCP, your priority is data processing efficiency and access to latest spark and delta features go for Databricks. 

On the other hand, if your infrastructure is primarily based on Azure and your use case involves data preparation for a data platform with data modeling on a Datalake (reach out if you are interested to know how), then Azure Synapse may be the better choice. Synapse has more features in development for future releases – something that has not been announced by Databricks yet. Good luck! And stay tuned for upcoming series focusing on ML, streaming, delta and partitioning.

Etlia Data Engineering on toteuttanut henkilöstöannin yhtiön työntekijöille

.