Databricks Inc.

09/17/2025 | News release | Archived content

Introducing Spatial SQL in Databricks: 80+ Functions for High-Performance Geospatial Analytics

Every day, billions of data points are tied to places on the map. Delivery routes, store visits, road networks, cell towers, and crop fields, all carry important context for making business decisions. The problem is that analyzing this data at scale has been hard. Legacy spatial systems are slow, require manual indexing, and often lock information into proprietary formats.

Today, we introduce Spatial SQL in Databricks. Spatial SQL brings geospatial analysis directly to the Databricks Platform. You can now work with native GEOMETRY and GEOGRAPHY data types, use more than 80 SQL functions, and run spatial joins at high speed and performance, all while keeping your data open and ready for scale.

Location data plays a role in nearly every industry, and Spatial SQL makes it easier to use that information.

Here are some examples:

  • Retail operations can understand where their customers come from by analyzing areas and foot traffic
  • Transportation analysts can improve safety and customer experience by analyzing vehicle incidents and cellular network connectivity
  • Energy companies can optimize crew deployment during outages and find ideal locations for locating wind and solar power farms
  • Agricultural operators can apply precision agriculture techniques to lower costs and improve yield efficiency
  • Insurance analysts can understand risk by analyzing policyholder addresses across flood, fire, and hurricane zones
  • Healthcare organizations can compare and predict health outcomes by analyzing environmental factors across geographies
  • And much more!

Spatial SQL is already helping customers speed up performance and drive lower costs:

"Databricks Spatial SQL has redefined how we run large-scale spatial joins. By integrating Spatial SQL functions into our processing pipelines, we've seen over 20X faster performance and more than 50% lower costs on the same workloads. This breakthrough makes it possible to integrate and deliver rich road network data at a scale and speed that simply wasn't feasible before."

- Laxmi Duddu, Sr. Manager, Autonomy Data Platform & Analytics, Rivian Automotive

Customers have previously struggled to manage and scale spatial workloads with legacy systems, third party libraries, or resorting to manual indexing strategies. With Spatial SQL, customers get out-of-the-box simplicity and scalability.

"Spatial SQL lets us scale geospatial ETL like never before. Instead of overloading PostGIS servers with heavy queries, we shift the load to Databricks and take advantage of distributed processing, fast spatial joins, and efficient handling of vector data. It's a more efficient, resilient, and scalable approach for handling large and complex geospatial datasets."

- Pierre Chenaux, Tech Leader of Geospatial department, TotalEnergies

A major driver of performance is support for first-class geospatial data types. Instead of storing geo data in string, binary, or decimal columns, you can now use native GEOMETRY and GEOGRAPHY data types. These types include bounding box statistics that Databricks uses during query execution to skip irrelevant data and speed up joins. Spatial SQL also provides import functions for standard formats such as Well Known Text, Well Known Binary, GeoJSON, and simple latitude or longitude values.

These data types are completely open in Parquet, Iceberg, and Delta. The Databricks team has contributed to shaping the proposed specifications, ensuring there is no lock-in with proprietary warehouses. With the approved Apache Sparkā„¢ SPIP, GEOMETRY and GEOGRAPHY will soon be first-class data types in the open source engine as well.

What can you do with Spatial SQL?

Spatial SQL is more than a set of new functions. It gives you the building blocks to manage the full journey of spatial data, from storage and import to analysis and transformation. By working with native data types and efficient operations, you can bring location into everyday queries without adding complexity.

Here are some of the core things you can do:

  • Store spatial data natively with GEOMETRY and GEOGRAPHY
  • Import and export in formats such as WKT, WKB, GeoJSON, and GeoHash
  • Build new objects with constructors like ST_Point or ST_MakeLine
  • Calculate measurements using functions like ST_Distance and ST_AREA
  • Perform spatial joins using relationships such as ST_Contains and ST_Intersects
  • Transform between coordinate systems with ST_Transform
  • Edit, validate, and combine spatial objects using ST_ISVALID or ST_UNION_AGG
  • And much more!

These features give you a complete toolkit for spatial analysis directly in SQL, also available with Python and Scala APIs. When you put them together, they unlock real workflows that matter in practice, which we will walk through in the next section.

Spatial SQL Examples in practice

Geospatial data is everywhere and growing. GPS traces with latitudes and longitudes are emitted from an increasing number of devices, sensors, and vehicles every second of the day. The world is being cataloged and updated constantly, with places, roads, networks, and boundaries modeled as points, lines, and polygons. Across every industry - retail, transportation and logistics, energy, climate and natural sciences, agriculture, public sector, financial services, real estate, insurance, telecommunications - location matters to every decision-maker who needs to understand the "where" in their data.

We have crafted four short examples to get started with working with the new spatial data types and expressions with the following goals.

  1. Prepare data for efficient processing by using the new GEOMETRY type
  2. Perform data enrichment by combining two spatial datasets using a spatial join
  3. Transform data into an appropriate spatial reference system to improve the accuracy of distance measurements
  4. Measure the distance between two cities

In these examples, we will use addresses, buildings, and divisions datasets from OvertureMaps.org. These datasets are offered for download in various ways, such as GeoParquet.

Overture Maps datasets visualized in Databricks Notebook with kepler.gl.

1. Creating a GEOMETRY column

The first step before performing any spatial analysis is to convert your data to use GEOMETRY OR GEOGRAPHY data types. After downloading the Overture Maps data, we simply need to create a native GEOMETRY column from the provided WKB geometry column and drop other unnecessary columns like bbox. A bounding box is the smallest rectangle that contains a geometry. In spatial queries, bounding boxes speed queries up by quickly discarding data that can't possibly overlap. If two bounding boxes don't intersect, the geometries inside them definitely don't, so the database can skip the expensive intersection check and reduce the amount of data being processed. We don't need the bbox field because this information is now managed in the column statistics. For these datasets, addresses are POINTS, while buildings and divisions are POLYGONS / MULTI-POLYGONS. Here is the initially downloaded buildings data, showing the first five columns.

This data can be easily converted into a Lakehouse Table with native GEOMETRY using ST_GeomFromWKB, shown in the example below for buildings. We know our data is in WGS84 (EPSG:4326), so we specify that in the creation of the spatial type. An SRID identifies the coordinate system of your spatial data, which defines the units (like degrees or meters) used in calculations such as distance and area. You must set a valid SRID when creating a geometry column, or the query will return an error. Also, note that our native types display in a human-friendly format (EWKT).

In addition to WKB, spatial data can also be directly imported into our native types from the most common interchange formats:

  • Latitude and longitude coordinates using ST_POINT
  • WKT using ST_GeomFromWKT or ST_GeomFromText
  • WKB using ST_GeomFromWKB or ST_GeomFromBinary
  • GeoJSON using ST_GeomFromGeoJSON
  • GeoHash using ST_PointFromGeoHash

Similarly, spatial data can be exported as a number of formats:

  • WKT using ST_AsWKT or ST_AsText
  • WKB using ST_AsWKB or ST_AsBinary
  • GeoJSON using ST_AsGeoJSON
  • Extended WKT using ST_AsEWKT
  • Extended WKB using ST_AsEWKB
  • GeoHash using ST_GeoHash

Note: We also have import and export expressions for GEOGRAPHY types.

2. Spatially joining multiple datasets together

Spatial joins are among the most important and widely used operations in geospatial data processing. They allow you to combine attributes from different datasets and perform aggregations or data enrichment based on their spatial relationships, like containment, intersection, and proximity. This makes spatial joins essential for answering real-world questions like identifying which buildings fall within a flood zone, assigning census demographics to customer addresses, and analyzing connected vehicles within cell coverage areas. Because so much geospatial analysis depends on integrating multiple datasets, spatial joins are often a first step in exploratory spatial analysis, spatial modeling, and location-based decision-making.

Next, we will join the address and division tables using a spatial join. Anyone who has worked with address data sources knows that addresses can be messy data (one common cause is that different countries use different addressing systems). Further, the address table does not include a full administrative hierarchy (i.e., no county information for US addresses). So we will use the division table to validate the city information and enrich it by adding county-equivalent information.

This data validation and enrichment process would be different to solve without a spatial join. To do this, we need to find the address inside the division. We will use ST_Contains to perform a point-in-polygon spatial join, letting Databricks handle the internals of the operation, no do-it-yourself spatial indexing required.

Now we can more readily standardize to the correct city, state, county, and country, e.g. replace missing cities in addresses using those provided in the divisions table.

After validating the addresses, we followed a similar approach to join the addresses on buildings using ST_Intersects to enrich the Buildings table with address information. For the US, this spatial join matched 44M addresses to buildings, with 55M buildings remaining unmatched. In the next example, let's see how we can use proximity to potentially identify buildings that did not match to an address.

3. Transforming data to specific spatial reference systems

Geospatial datasets are often created in different coordinate reference systems (CRS), such as latitude-longitude (WGS84) or projected systems like UTM, depending on their source and purpose. While each CRS defines how the earth's curved surface is represented on a flat map, using datasets with mismatched projections can cause features to misalign, distort distances, or produce incorrect spatial joins and measurements. A store located in a flood zone will not match in a spatial join if using different coordinate systems. For accurate analysis-whether calculating areas, joining layers, or visualizing spatial relationships-it's essential to ensure all datasets are transformed into the same projection so they share a consistent spatial reference.

To identify addresses within proximity of the remaining 55M unmatched buildings in the US, let's project our WGS84 GEOMETRY data to Conus Albers (EPSG:5070) for North America, which gives us units in meters. This is accomplished with the ST_Transform function.

Let's apply ST_DWithin between our unmatched U.S. buildings and addresses, using a distance within value of just 2 meters.

The distance within value can be increased as needed to gather a set of potential address matches; also, a Recursive CTE can be useful to iterate over multiple distances. For this example, a filter polygon allows us to easily isolate our search to the vicinity of Saint Petersburg, Florida. The polygon is initially prepared from WKT using ST_GeomFromWKT, then transformed into SRID 5070 to match the address and building data.

To set up datasets for the recursive CTE, we apply a spatial filter over our data by intersecting buildings and addresses with the search polygon, showing buildings below (addresses are handled similarly).

The recursive CTE below iterates over buildings to identify address candidates within 5, 10, and 15 meters. The result table removes duplicate addresses over successive distances using the following window expression: QUALIFY RANK() OVER (PARTITION BY blg_id,addr_id ORDER BY dwithin) = 1.

Here are the address candidates around one of the buildings, showing matches at 5m (blue), 10m (orange) and 15m (green). This is rendered using Databricks' built-in Marker Maps using clustering mode, which will fan out close points for easier viewing. When creating a Dashboard, we might also have used AI/BI Point Maps, which support cross-filtering and drill-through.

There are numerous alternate uses of Recursive CTEs for spatial processing, e.g., implementing Prim's algorithm for building a minimum spanning tree of your delivery points.

4. Measuring distances between locations

Proximity is a core concept in spatial analysis because distance often determines the strength or relevance of relationships between locations. Whether identifying the nearest hospital, analyzing competition between stores, or mapping commuter patterns, understanding how close features are to one another provides critical context.

Continuing with our example dataset, we performed the same Conus Albers transform operation on our cities in Florida to measure their distances. We are measuring from the geometric center of the cities, generated using the function: ST_Centroid.

When calculating the distance between two GEOMETRIES, there are several different functions to consider:

  1. ST_Distance - Returns the Cartesian distance in the units of the provided GEOMETRIES, calculating the straight-line path based on their x and y coordinates, as if the Earth were flat.
  2. ST_DistanceSphere - Returns the spherical distance (always in meters) between two point GEOMETRIES, measured on the mean radius of the WGS84 ellipsoid of the Earth, with coordinate points assumed to have units of degrees, e.g., SRID 4326 would be valid but not SRID 5070.
  3. ST_DistanceSpheroid - Returns the geodesic distance (always in meters) between two point GEOMETRIES on the WGS84 ellipsoid of the Earth; also, with coordinate points assumed to have units of degrees. Again, SRID 4326, or any other SRID with units in degrees, would be valid inputs for this example.

The various distance calculations are applied between city1 and city2 with ST_Distance used for the GEOMETRIES in 5070, then ST_DistanceSphere and ST_DistanceSpheroid for the GEOMETRIES in 4326.

Among the distance functions, we would anticipate ST_DistanceSpheriod (measurements based on Earth's ellipsoidal shape) to be the most accurate in this instance, followed by ST_DistanceSphere (measurements assume Earth is a perfect sphere). Note that ST_Distance is most useful when working on projected coordinate reference systems or when the Earth's curvature can otherwise be ignored. Even though SRID 5070 is in meters, we can see the effects of Cartesian calculations over larger distances. ST_Distance would not generally be a suitable choice for SRID 4326, given that the distance covered by one degree of longitude changes drastically as you move from the equator toward the poles, e.g., 1 degree of longitude differs by up to 6KM just within the state of Florida.

Proximity by distance between Florida cities, visualized in Databricks Notebook with kepler.gl.

Spatial SQL includes 80+ functions, allowing customers to perform common spatial data operations with simplicity and scale. Powered by Spatial SQL, customers are now starting to shift their approach to managing and integrating with GIS systems:

"Spatial data is at the core of everything we do at OSPRI, be it Livestock Traceability, Disease Eradication, or Pest Management. Databricks Spatial SQL allows us to fully integrate Databricks with all our work. These advancements allow us to shift large desktop-based spatial modelling tasks to a platform where they're closer to the data and can be run in parallel, at speed. Weeks of iterations across operational boundaries can be comfortably run within a day or two, reducing our time to decision. These new functions also allow us to make Databricks the integration layer between our transactional systems and our GIS platform, ensuring it can be informed by data across the organisation without compromise." - Campbell Fleury, Manager, Data and Information Products, OSPRI New Zealand

What's next

There is so much you can do with Spatial SQL in Databricks, with more to come, including new expressions and faster spatial joins. If you would like to share what additional ST expressions you require, please fill out this short survey.

Databricks Inc. published this content on September 17, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 23, 2025 at 18:44 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]