Old Linked Server Setup:
- Linked server from SQL Server DB, to another SQL Server DB: Worked fine for its purposes.
New Linked Server Setup:
- Linked server from SQL Server DB, to 2 Synapse DBs:
- DB1 - A DB that's Views which point to CSV files in the Datalake: Performance is good. Possibly better than the 'old setup'
- DB2 - A DB that's External Tables, that point to Delta Files in the Datalake: Performance is bad, and it's easy to write a query that hangs.
Edit: I'm useing 4 part names. For some simple queries, using openquery can return better results. But often it still hangs with openquery as well - We also need to pass parameters so openquery isn't an overall solution.
So it's just queries across the linked server, when hitting those Spark, External-Tables-to-Delta, that works poorly. e.g.
- Doing a join across the linked server, on two varchar columns = complete hang. Changing datatypes in External Tables to nvarchar fixed this,despite being against best practices per Microsoft
Use the varchar type with some UTF8 collation if you're reading data from Parquet, Azure Cosmos DB, Delta Lake, or CSV with UTF-8 encoding.
- Doing simple WHERE clasues on varchar/nvarchar columns can cause performance to hang. e.g. WHERE left(varcharcolumn,1) = 'v', causes a query across the linked server, to go from 10 seconds to > 20 minutes, and presumably never would complete.
Linked Servers aren't the best, but I'm wondering if there's something I'm missing, that's making queries across the linked server to those Synapse Spark Tables so terrible.
Things I've tried:
- Switching from varchar to nvarchar when hitting a hang (which was a huge help, as stated above...for some reason)
- Being super specific on the length of varchar / nvarchar
- casting the varchar / nvarchar columns in my query
- Blindly changing Linked Server settings that shouldn't matter in this case
Example query
select
dp.visionwbs1
from LinkedServer.Db.dbo.dimproject dp
left join LinkedServer.Db.dbo.dimclient dc
on dc.pk_clientid = dp.fk_clientid
--Runs in :04
But say this one:
select
dp.visionwbs1
from LinkedServer.Db.dbo.dimproject dp
left join LinkedServer.Db.dbo.dimclient dc
on dc.pk_clientid = dp.fk_clientid
where left(dp.visionwbs1,1) = 'p'
-stalls after returning 141 rows, never to complete
The same queries if ran on a linked server, against a regular SQL server (the same tables are still available on our pre-Synapse database) - complete in :01 and < :01 respectively.
The above simple example, say - could run well if using OPENQUERY. Often that doesn't help though, and is a moot point, as we need to pass parameters.
pastbin link of the live query statistics from the above one https://pastebin.com/6ZtB65Tf