Linked server to Synapse Spark tables: Poor performance. Any best practices? (Aside from don't use linked servers)

104 Views Asked by At

Old Linked Server Setup:

  • Linked server from SQL Server DB, to another SQL Server DB: Worked fine for its purposes.

New Linked Server Setup:

  • Linked server from SQL Server DB, to 2 Synapse DBs:
  • DB1 - A DB that's Views which point to CSV files in the Datalake: Performance is good. Possibly better than the 'old setup'
  • DB2 - A DB that's External Tables, that point to Delta Files in the Datalake: Performance is bad, and it's easy to write a query that hangs.

Edit: I'm useing 4 part names. For some simple queries, using openquery can return better results. But often it still hangs with openquery as well - We also need to pass parameters so openquery isn't an overall solution.

So it's just queries across the linked server, when hitting those Spark, External-Tables-to-Delta, that works poorly. e.g.

Use the varchar type with some UTF8 collation if you're reading data from Parquet, Azure Cosmos DB, Delta Lake, or CSV with UTF-8 encoding.

  • Doing simple WHERE clasues on varchar/nvarchar columns can cause performance to hang. e.g. WHERE left(varcharcolumn,1) = 'v', causes a query across the linked server, to go from 10 seconds to > 20 minutes, and presumably never would complete.

Linked Servers aren't the best, but I'm wondering if there's something I'm missing, that's making queries across the linked server to those Synapse Spark Tables so terrible.

Things I've tried:

  • Switching from varchar to nvarchar when hitting a hang (which was a huge help, as stated above...for some reason)
  • Being super specific on the length of varchar / nvarchar
  • casting the varchar / nvarchar columns in my query
  • Blindly changing Linked Server settings that shouldn't matter in this case

Example query

select
dp.visionwbs1

from LinkedServer.Db.dbo.dimproject dp

left join  LinkedServer.Db.dbo.dimclient dc
on dc.pk_clientid = dp.fk_clientid

--Runs in :04

But say this one:

select
dp.visionwbs1

from LinkedServer.Db.dbo.dimproject dp

left join  LinkedServer.Db.dbo.dimclient dc
on dc.pk_clientid = dp.fk_clientid

where left(dp.visionwbs1,1) = 'p'

-stalls after returning 141 rows, never to complete

The same queries if ran on a linked server, against a regular SQL server (the same tables are still available on our pre-Synapse database) - complete in :01 and < :01 respectively.

The above simple example, say - could run well if using OPENQUERY. Often that doesn't help though, and is a moot point, as we need to pass parameters.

pastbin link of the live query statistics from the above one https://pastebin.com/6ZtB65Tf

0

There are 0 best solutions below