SQL Find Reciprocal Relationship

1.5k Views Asked by At

I am trying to find a situation using the Stack Exchange Data Explorer (SEDE) where two distinct users on Stack Overflow have accepted an answer from each other. So for example:

Post A { Id: 1, OwnerUserId: "user1", AcceptedAnswerId: "user2" }

and

Post B { Id: 2, OwnerUserId: "user2", AcceptedAnswerId: "user1" }

I currently have a query that can find two users that have collaborated on more than question as questioner-answerer but it does not determine if that relationship is reciprocal:

SELECT user1.Id AS User_1, user2.Id AS User_2
FROM Posts p
INNER JOIN Users user1 ON p.OwnerUserId = user1.Id
INNER JOIN Posts p2 ON p.AcceptedAnswerId = p2.Id
INNER JOIN Users user2 ON p2.OwnerUserId = user2.Id
WHERE p.OwnerUserId <> p2.OwnerUserId
AND p.OwnerUserId IS NOT NULL
AND p2.OwnerUserId IS NOT NULL
AND user1.Id <> user2.Id
GROUP BY user1.Id, user2.Id HAVING COUNT(*) > 1;

For anyone unfamiliar with the schema, there are two tables like so:

Posts
--------------------------------------
Id                      int
PostTypeId              tinyint
AcceptedAnswerId        int
ParentId                int
CreationDate            datetime
DeletionDate            datetime
Score                   int
ViewCount               int
Body                    nvarchar (max)
OwnerUserId             int
OwnerDisplayName        nvarchar (40)
LastEditorUserId        int
LastEditorDisplayName   nvarchar (40)
LastEditDate            datetime
LastActivityDate        datetime
Title                   nvarchar (250)
Tags                    nvarchar (250)
AnswerCount             int
CommentCount            int
FavoriteCount           int
ClosedDate              datetime
CommunityOwnedDate      datetime

And

Users
--------------------------------------
Id                      int
Reputation              int
CreationDate            datetime
DisplayName             nvarchar (40)
LastAccessDate          datetime
WebsiteUrl              nvarchar (200)
Location                nvarchar (100)
AboutMe                 nvarchar (max)
Views                   int
UpVotes                 int
DownVotes               int
ProfileImageUrl         nvarchar (200)
EmailHash               varchar (32)
AccountId               int
5

There are 5 best solutions below

0
On BEST ANSWER

One CTE and simple inner joins will do the job. There's no need for so much code as I've observed in other answers. Note a lot of comments in mine.

Link to StackExchange Data Explorer with sample result saved

with questions as ( -- this is needed so that we have ids of users asking and answering
select
   p1.owneruserid as question_userid
 , p2.owneruserid as answer_userid
 --, p1.id -- to view sample ids
from posts p1
inner join posts p2 on -- to fetch answer post
  p1.acceptedanswerid = p2.id
)
select distinct -- unique pairs
    q1.question_userid as userid1
  , q1.answer_userid as userid2
  --, q1.id, q2.id -- to view sample ids
from questions q1
inner join questions q2 on
      q1.question_userid = q2.answer_userid -- accepted answer from someone
  and q1.answer_userid = q2.question_userid -- who also accepted our answer
  and q1.question_userid <> q1.answer_userid -- and we aren't self-accepting

This brings as an example posts:

Though, StackExchange may throw you timeout because of large dataset and the distinct part. If you wish to view some data, remove distinct and add top N at start:

with questions as (
...
)
select top 3 ...
0
On

ETA: Oops. Misread the question; Op wants Accepted answers and below is for any reciprocal answers. (It's easy to modify but I'm more interested in the latter anyway.)


Because of the very large dataset (and the need to not time-out SEDE), I chose to restrict the sets AMAP and build from there.

So this query:

  1. Only returns any rows if there is a reciprocal relationship.
  2. Returns all such Q&A pairs.
  3. Excludes self answers.
  4. Leverages SEDE's query parameters and magic columns for usability.

See it live in SEDE.

-- UserA: Enter ID of user A
-- UserB: Enter ID of user B
WITH possibleAnswers AS (
    SELECT
                a.Id                AS aId
                , a.ParentId        AS qId
                , a.OwnerUserId   
                , a.CreationDate
    FROM        Posts a
    WHERE       a.PostTypeId        = 2  --  answers
    AND         a.OwnerUserId       IN (##UserA:INT##, ##UserB:INT##)
),
possibleQuestions AS (
    SELECT
                q.Id                AS qId
                , q.OwnerUserId   
                , q.Tags
    FROM        Posts q
    INNER JOIN  possibleAnswers pa  ON q.Id = pa.qId
    WHERE       q.PostTypeId        = 1  --  questions
    AND         q.OwnerUserId       IN (##UserA:INT##, ##UserB:INT##)
    AND         q.OwnerUserId       != pa.OwnerUserId  --  No self answers
)
SELECT 
            pa.OwnerUserId          AS [User Link]
            , 'answers'             AS [Action]
            , pq.OwnerUserId        AS [User Link]
            , pa.CreationDate       AS [at]
            , pq.qId                AS [Post Link]
            , pq.Tags
FROM        possibleQuestions pq
INNER JOIN  possibleAnswers pa      ON pq.qId = pa.qId
WHERE       pq.OwnerUserId          =  ##UserB:INT##
AND         EXISTS (SELECT * FROM possibleQuestions pq2  WHERE pq2.OwnerUserId =  ##UserA:INT##)

UNION ALL SELECT 
            pa.OwnerUserId          AS [User Link]
            , 'answers'             AS [Action]
            , pq.OwnerUserId        AS [User Link]
            , pa.CreationDate       AS [at]
            , pq.qId                AS [Post Link]
            , pq.Tags
FROM        possibleQuestions pq
INNER JOIN  possibleAnswers pa      ON pq.qId = pa.qId
WHERE       pq.OwnerUserId          =  ##UserA:INT##
AND         EXISTS (SELECT * FROM possibleQuestions pq2  WHERE pq2.OwnerUserId =  ##UserB:INT##)

ORDER BY    pa.CreationDate

It produces results like (Click for larger view):

results


For a list of all such user pairs see this SEDE query.

0
On

Here's how I'd go about it. Here's some simplified data:

if object_id('tempdb.dbo.#Posts') is not null drop table #Posts
create table #Posts
(
    PostId char(1),
    OwnerUserId int,
    AcceptedAnswerUserId int
)

insert into #Posts
values
('A', 1, 2),
('B', 2, 1),
('C', 2, 3),
('D', 2, 4),
('E', 3, 1),
('F', 4, 1)

For our purposes, we don't really care about the PostId, and what we have as a starting point is a set of ordered pairs of post owners (OwnerUserId) and accepted answerers (AcceptedAnswerUserId).

(Although not necessary, you can visualize the set like so)

select distinct OwnerUserId, AcceptedAnswerUserId
from #Posts

Now we want to find all the entries in this set which have those two fields reversed. I.e. where the owner if one post is the accepted answerer of the other. So where a pair is (1, 2), we want to find (2, 1).

I did this with a left join so you could see the rows it's omitting, but changing it to an inner join will restrict it to the set you've described. You can harvest the information however you please (either by picking any of the columns out of a hat, or if you want them on a single row, returning both columns from exactly one of the tables).

select 
    u1.OwnerUserId, 
    u1.AcceptedAnswerUserId, 
    u2.OwnerUserId, 
    u2.AcceptedAnswerUserId
from #Posts u1
left outer join #Posts u2
    on u1.AcceptedAnswerUserId = u2.OwnerUserId
        and u1.OwnerUserId = u2.AcceptedAnswerUserId

Edit If you want to exclude self answers, just add and u1.AcceptedAnswerUserId != u1.OwnerUserId to the on clause.

On a personal note, I've always found it funny how deeply rooted SQL and Relational Algebra are in set theory, and yet doing set based operations like this in SQL tend to feel very clunky. Mostly because to preserve the lack of order, you have to represent set members in a single column. But then to compare set members in SQL, you sort of need to represent the set members as separate columns.

Now ponder this, how might you extend this to triads of users commenting on the same post?

0
On

Using the technique from Salman A's answer, improved the sorting and added some more useful columns.

In conjunction with the queries in my other answer, it shows some interesting relationships.

See it in SEDE.

WITH QandA_users AS (
    SELECT      q.OwnerUserId   AS userQ
                , a.OwnerUserId AS userA
    FROM        Posts q
    INNER JOIN  Posts a         ON q.AcceptedAnswerId = a.Id
    WHERE       q.PostTypeId    = 1
),
pairsUnion (user1, user2, whoAnswered) AS (
    SELECT  userQ, userA, 'usr 2 answered'
    FROM    QandA_users
    WHERE   userQ <> userA
    UNION ALL
    SELECT  userA, userQ, 'usr 1 answered'
    FROM    QandA_users
    WHERE   userQ <> userA
),
collaborators AS (
    SELECT      user1, user2, COUNT(*) AS [Reciprocations]
    FROM        pairsUnion
    GROUP BY    user1, user2
    HAVING COUNT (DISTINCT whoAnswered) > 1
)
SELECT
            'site://u/' + CAST(c.user1 AS NVARCHAR) + '|Usr ' + u1.DisplayName      AS [User 1]
            , 'site://u/' + CAST(c.user2 AS NVARCHAR) + '|Usr ' + u2.DisplayName    AS [User 2]
            , c.Reciprocations                                                      AS [Reciprocal Accptd posts]
            , (SELECT COUNT(*)  FROM QandA_users qau  WHERE qau.userQ = c.user1)    AS [Usr 1 Qstns wt Accptd]
            , (SELECT COUNT(*)  FROM QandA_users qau  WHERE qau.userQ = c.user1  AND qau.userA = c.user2) AS [Accptd Ansr by Usr 2]
            , (SELECT COUNT(*)  FROM QandA_users qau  WHERE qau.userA = c.user2)    AS [Usr 2 Ttl Accptd Answrs]
FROM        collaborators c
INNER JOIN  Users u1        ON u1.Id = c.user1
INNER JOIN  Users u2        ON u2.Id = c.user2
ORDER BY    c.Reciprocations DESC
            , u1.DisplayName
            , u2.DisplayName

Results like:

results

0
On

The query in its simplest form (so that it does not timeout querying 16M questions) would be:

WITH accepter_acceptee(a, b) AS (
    SELECT q.OwnerUserId, a.OwnerUserId
    FROM Posts AS q
    INNER JOIN Posts AS a ON q.AcceptedAnswerId = a.Id
    WHERE q.PostTypeId = 1 AND q.OwnerUserId <> a.OwnerUserId
), collaborations(a, b, type) AS (
    SELECT a, b, 'a accepter b' FROM accepter_acceptee
    UNION ALL
    SELECT b, a, 'a acceptee b' FROM accepter_acceptee
)
SELECT a, b, COUNT(*) AS [collaboration count]
FROM collaborations
GROUP BY a, b
HAVING COUNT(DISTINCT type) = 2
ORDER BY a, b

Result: