SQL: Find missing hierarchy Folders (Paths) in a table

316 Views Asked by At

I have a table which contains Folders Paths. I need to find all the "gaps" between those folders in the hierarchy. I mean that, if the table contains these 3 folders:

'A'
'A\B\C'
'A\B\C\D\E\F\G'

I need to find the following missing folders in the hierarchy:

'A\B'
'A\B\C\D'
'A\B\C\D\E'
'A\B\C\D\E\F'

This table contains more than 250,000 records of folders, so we seek for the most efficient way to do so, otherwise the script will be stuck for long time, time we don't have.

Comment: I don't have list of all folders. What I have are the "root" folders and the "leafs" folders which I need to find the "gaps" between them in the hierarchy.

Second comment: The table can contains more than one hierarchy and we need to find the "gaps" in all of the hierarchies. For that matter, There are 2 another int columns: "DirID" and "BaseDirID". The "DirID" column is the id column in our table. The "BaseDirID" contains the id of the first folder in the hierarchy. So all the folders (paths) from the same hierarchy share the same value in this column. Sample data for example:

Example sample data

DirID   BaseDirID   DisplayPath
1   1   'A'
2   1   'A\B\C'
3   1   'A\B\C\D\E'
4   4   'U'
5   4   'U\V\W'
6   4   'U\V\W\X\Y'

So we need to find the following data:

Expected Results

BaseDirID   DisplayPath
1   'A\B'
1   'A\B\C\D'
4   'U\V'
4   'U\V\W\X'

Thanks in advance.

1

There are 1 best solutions below

9
On BEST ANSWER

Here is one approach using Recursive CTE and split string function

;WITH existing_hierachies
     AS (SELECT DirID,
                BaseDirID,
                DisplayPath
         FROM   (VALUES (1,1,'A' ),
                        (2,1,'A\B\C' ),
                        (3,1,'A\B\C\D\E' ),
                        (4,4,'U' ),
                        (5,4,'U\V\W' ),
                        (6,4,'U\V\W\X\Y' )) tc (DirID, BaseDirID, DisplayPath) ),
     folders_list
     AS (SELECT ItemNumber,
                item fol,
                BaseDirID
         FROM   (SELECT row_number()over(partition by BaseDirID order by Len(DisplayPath) DESC)rn,*
                 FROM   existing_hierachies) a
                 CROSS apply dbo.[Delimitedsplit8k](DisplayPath, '\')
                 Where Rn = 1),
     rec_cte
     AS (SELECT *,
                Cast(fol AS VARCHAR(4000))AS hierar
         FROM   folders_list
         WHERE  ItemNumber = 1
         UNION ALL
         SELECT d.*,
                Cast(rc.hierar + '\' + d.fol AS VARCHAR(4000))
         FROM   rec_cte rc
                JOIN folders_list d
                  ON rc.BaseDirID = d.BaseDirID
                  AND d.ItemNumber = rc.ItemNumber + 1)
SELECT rc.BaseDirID,
       rc.hierar AS Missing_Hierarchies
FROM   rec_cte rc
WHERE  NOT EXISTS (SELECT 1
                   FROM   existing_hierachies eh 
                   WHERE  eh.BaseDirID = rc.BaseDirID 
                     AND  eh.DisplayPath = rc.hierar) 
Order by rc.BaseDirID

Result :

+-----------+---------------------+
| BaseDirID | Missing_Hierarchies |
+-----------+---------------------+
|         1 | A\B                 |
|         1 | A\B\C\D             |
|         4 | U\V                 |
|         4 | U\V\W\X             |
+-----------+---------------------+

Split string function code

CREATE FUNCTION [dbo].[DelimitedSplit8K]
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover NVARCHAR(4000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT 1 UNION ALL
                 SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT s.N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
                   FROM cteStart s
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
 SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = SUBSTRING(@pString, l.N1, l.L1)
   FROM cteLen l
;
GO

Referred from http://www.sqlservercentral.com/articles/Tally+Table/72993/