Use an initial query to merge queries across multiple databases?

111 Views Asked by At

Using the Data Explorer (SEDE), I would like to find which users have more than 200000 reputation on Stack Overflow, and then find details for any accounts they have on other Stack Exchange sites.

Here is the query which provides the list with this threshold:

Select id, reputation, accountid
From users
Where reputation > 200000

AccountId is the key for all Stack Exchange sites.

I have found this query for aggregating across SEDE databases, but how is it possible to do that based on the dynamic results of the previous/baseline query?

Here is the kind of output I'm aiming for:

id_so, reputation_so, accounted, other_stackexchange_site_name, reputation_othersite, number_of_answers_other_site, number_of_questions_other_site
1, 250000, 23, serverfault, 500, 5, 1
1, 250000, 23, superuser, 120, 1, 0
2, 300000, 21, serverfault, 300, 3, 2
2, 300000, 21, webmasters, 230, 1, 1
3, 350000, 20, NA, NA, NA, NA
#the case with id 3 has an SO profile with reputation but it has no other profile in other Stack Exchange site
1

There are 1 best solutions below

0
On BEST ANSWER

To run non-trivial queries across databases, based on an initial query:

  1. Figure out the common key in all databases. In this case it's AccountId (which is a user's Stack-Exchange-wide Id).
  2. Create your initial query to feed that key into a temp table. In this case:

    CREATE TABLE #UsersOfInterest (AccountId INT)
    INSERT INTO  #UsersOfInterest  
        SELECT  u.AccountId
        FROM    Users u
        Where   u.Reputation > 200000
    
  3. Create Another temp table to hold the final results (see below).
  4. Determine the query, to run on each site, that gets the info you want. EG:

    SELECT  u.AccountId, u.DisplayName, u.Reputation, u.Id
            , numQst = (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 1)
            , numAns = (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 2)
    FROM    Users u
    WHERE   u.AccountId = ##seAccntId##
    
  5. Use a system query to get the appropriate databases. For the Data Explorer (SEDE), a query of this type:

    SELECT      name
    FROM        sys.databases
    WHERE       CASE    WHEN state_desc = 'ONLINE'
                        THEN OBJECT_ID (QUOTENAME (name) + '.[dbo].[PostNotices]', 'U')
                END
                IS NOT NULL
    
  6. Create a cursor on the above query and use it to step through the databases.
    For each database:

    1. Build a query string that takes the query of step 4 and puts it into the temp table of step 3.
    2. Run the query string using sp_executesql.
  7. When the cursor is done, perform the final query on the temp table from step 3.


Refer to this other answer, for a working template for querying all of the Stack Exchange sites.

Putting it all together, results in the following query, which you can run live on SEDE:

-- MinMasterSiteRep: User's must have this much rep on whichever site this query is run against
-- MinRep: User's must have this much rep on all other sites

CREATE TABLE #UsersOfInterest (
    AccountId       INT NOT NULL
    , Reputation    INT
    , UserId        INT
    , PRIMARY KEY (AccountId)
)
INSERT INTO  #UsersOfInterest
    SELECT  u.AccountId, u.Reputation, u.Id
    FROM    Users u
    Where   u.Reputation > ##MinMasterSiteRep:INT?200000##

CREATE TABLE #AllSiteResults (
      [Master Rep]          INT
      , [Mstr UsrId]        NVARCHAR(777)
      , AccountId           NVARCHAR(777)
      , [Site name]         NVARCHAR(777)
      , [Username on site]  NVARCHAR(777)
      , [Rep]               INT
      , [# Ans]             INT
      , [# Qst]             INT
)

DECLARE @seDbName       AS NVARCHAR(777)
DECLARE @seSiteURL      AS NVARCHAR(777)
DECLARE @sitePrettyName AS NVARCHAR(777)
DECLARE @seSiteQuery    AS NVARCHAR(max)

DECLARE seSites_crsr CURSOR FOR
WITH dbsAndDomainNames AS (
    SELECT      dbL.dbName
                , STRING_AGG (dbL.domainPieces, '.')    AS siteDomain
    FROM (
        SELECT      TOP 50000   -- Never be that many sites and TOP is needed for order by, below
                    name        AS dbName
                    , value     AS domainPieces
                    , row_number ()  OVER (ORDER BY (SELECT 0)) AS [rowN]
        FROM        sys.databases
        CROSS APPLY STRING_SPLIT (name, '.')
        WHERE       CASE    WHEN state_desc = 'ONLINE'
                            THEN OBJECT_ID (QUOTENAME (name) + '.[dbo].[PostNotices]', 'U') -- Pick a table unique to SE data
                    END
                    IS NOT NULL
        ORDER BY    dbName, [rowN] DESC
    ) AS dbL
    GROUP BY    dbL.dbName
)
SELECT      REPLACE (REPLACE (dadn.dbName, 'StackExchange.', ''), '.', ' ' )  AS [Site Name]
            , dadn.dbName
            , CASE  -- See https://meta.stackexchange.com/q/215071
                    WHEN dadn.dbName = 'StackExchange.Mathoverflow.Meta'
                    THEN 'https://meta.mathoverflow.net/'
                    -- Some AVP/Audio/Video/Sound kerfuffle?
                    WHEN dadn.dbName = 'StackExchange.Audio'
                    THEN 'https://video.stackexchange.com/'
                    -- Ditto
                    WHEN dadn.dbName = 'StackExchange.Audio.Meta'
                    THEN 'https://video.meta.stackexchange.com/'
                    -- Normal site
                    ELSE 'https://' + LOWER (siteDomain) + '.com/'
            END AS siteURL
FROM        dbsAndDomainNames dadn
WHERE       (dadn.dbName = 'StackExchange.Meta'  OR  dadn.dbName NOT LIKE '%Meta%')

-- Step through cursor
OPEN    seSites_crsr
FETCH   NEXT FROM seSites_crsr INTO @sitePrettyName, @seDbName, @seSiteURL
WHILE   @@FETCH_STATUS = 0
BEGIN
    SET @seSiteQuery = '
        USE [' + @seDbName + ']

        INSERT INTO #AllSiteResults
            SELECT
                        uoi.Reputation                                                                                  AS [Master Rep]
                        , ''site://u/'' + CAST(uoi.UserId AS NVARCHAR(88)) + ''|'' + CAST(uoi.UserId AS NVARCHAR(88))   AS [Mstr UsrId]
                        , [AccountId] = ''https://stackexchange.com/users/'' + CAST(u.AccountId AS NVARCHAR(88)) + ''?tab=accounts|'' + CAST(u.AccountId AS NVARCHAR(88))
                        , ''' + @sitePrettyName + '''                                                                   AS [Site name]
                        , ''' + @seSiteURL + ''' + ''u/'' + CAST(u.Id AS NVARCHAR(88)) + ''|'' + u.DisplayName          AS [Username on site]
                        , u.Reputation                                                                                  AS [Rep]
                        , (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 2)          AS [# Ans]
                        , (SELECT COUNT(q.Id)  FROM Posts q  WHERE q.OwnerUserId = u.Id  AND q.PostTypeId = 1)          AS [# Qst]
            FROM        #UsersOfInterest uoi
            INNER JOIN  Users u                ON uoi.AccountId = u.AccountId
            WHERE       u.Reputation > ##MinRep:INT?200##
    '
    EXEC sp_executesql @seSiteQuery

    FETCH NEXT FROM seSites_crsr INTO @sitePrettyName, @seDbName, @seSiteURL
END
CLOSE       seSites_crsr
DEALLOCATE  seSites_crsr

SELECT      *
FROM        #AllSiteResults
ORDER BY    [Master Rep] DESC, AccountId, [Rep] DESC

It gives results like:

Query output

-- where the blue values are hyperlinked.


Note that a user must have 200 rep on a site for it to be "significant". That's also the rep needed for the site to be included in the Stack Exchange flair.