How can I count rows in a 1:N:N relation in a faster way?

This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.

I have 3 tables in my MySQL database, their structure is shown below:

word_list (5 million rows)





    +-----+--------+

    | wid | word   |

    +-----+--------+

    |   1 | foo    |

    |   2 | bar    |

    |   3 | hello  |

    +-----+--------+

paper_word_relation (10 million rows)





    +-----+-------+

    | pid | word  | 

    +-----+-------+

    |   1 |    1  | 

    |   1 |    2  | 

    |   1 |    3  | 

    |   2 |    1  | 

    |   2 |    3  | 

    +-----+-------+

paper_citation_relation (80K rows)





    +----------+--------+

    | pid_from | pid_to | 

    +----------+--------+

    |        1 |     2  | 

    |        1 |     3  | 

    |        1 |     4  |

    |        2 |     1  |

    |        2 |     3  |

    +----------+--------+

I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)

I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





    SELECT COUNT(*) FROM (

    SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a 

    INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

    INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

    WHERE b.word = 2 AND c.word = 2) AS d

How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?

I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)

Thanks!

asked Nov 20 '18 at 6:23

TomLeung

387

Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07

add a comment |

This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.

I have 3 tables in my MySQL database, their structure is shown below:

word_list (5 million rows)





    +-----+--------+

    | wid | word   |

    +-----+--------+

    |   1 | foo    |

    |   2 | bar    |

    |   3 | hello  |

    +-----+--------+

paper_word_relation (10 million rows)





    +-----+-------+

    | pid | word  | 

    +-----+-------+

    |   1 |    1  | 

    |   1 |    2  | 

    |   1 |    3  | 

    |   2 |    1  | 

    |   2 |    3  | 

    +-----+-------+

paper_citation_relation (80K rows)





    +----------+--------+

    | pid_from | pid_to | 

    +----------+--------+

    |        1 |     2  | 

    |        1 |     3  | 

    |        1 |     4  |

    |        2 |     1  |

    |        2 |     3  |

    +----------+--------+

I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)

I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





    SELECT COUNT(*) FROM (

    SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a 

    INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

    INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

    WHERE b.word = 2 AND c.word = 2) AS d

How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?

I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)

Thanks!

asked Nov 20 '18 at 6:23

TomLeung

387

Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07

add a comment |

This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.

I have 3 tables in my MySQL database, their structure is shown below:

word_list (5 million rows)





    +-----+--------+

    | wid | word   |

    +-----+--------+

    |   1 | foo    |

    |   2 | bar    |

    |   3 | hello  |

    +-----+--------+

paper_word_relation (10 million rows)





    +-----+-------+

    | pid | word  | 

    +-----+-------+

    |   1 |    1  | 

    |   1 |    2  | 

    |   1 |    3  | 

    |   2 |    1  | 

    |   2 |    3  | 

    +-----+-------+

paper_citation_relation (80K rows)





    +----------+--------+

    | pid_from | pid_to | 

    +----------+--------+

    |        1 |     2  | 

    |        1 |     3  | 

    |        1 |     4  |

    |        2 |     1  |

    |        2 |     3  |

    +----------+--------+

I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)

I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





    SELECT COUNT(*) FROM (

    SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a 

    INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

    INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

    WHERE b.word = 2 AND c.word = 2) AS d

How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?

I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)

Thanks!

asked Nov 20 '18 at 6:23

TomLeung

387

This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.

I have 3 tables in my MySQL database, their structure is shown below:

word_list (5 million rows)





    +-----+--------+

    | wid | word   |

    +-----+--------+

    |   1 | foo    |

    |   2 | bar    |

    |   3 | hello  |

    +-----+--------+

paper_word_relation (10 million rows)





    +-----+-------+

    | pid | word  | 

    +-----+-------+

    |   1 |    1  | 

    |   1 |    2  | 

    |   1 |    3  | 

    |   2 |    1  | 

    |   2 |    3  | 

    +-----+-------+

paper_citation_relation (80K rows)





    +----------+--------+

    | pid_from | pid_to | 

    +----------+--------+

    |        1 |     2  | 

    |        1 |     3  | 

    |        1 |     4  |

    |        2 |     1  |

    |        2 |     3  |

    +----------+--------+

I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)

I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





    SELECT COUNT(*) FROM (

    SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a 

    INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

    INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

    WHERE b.word = 2 AND c.word = 2) AS d

How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?

I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)

Thanks!

mysql sql

asked Nov 20 '18 at 6:23

TomLeung

387

asked Nov 20 '18 at 6:23

TomLeung

387

asked Nov 20 '18 at 6:23

TomLeung

387

asked Nov 20 '18 at 6:23

TomLeung

387

asked Nov 20 '18 at 6:23

TomLeung

387

Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07

add a comment |

Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07

Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07

add a comment |

2 Answers
2

active

oldest

votes

If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.

I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.

SELECT COUNT(DISTINCT a.pid_from) 

FROM paper_citation_relation AS a 

INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

WHERE b.word = 2 AND c.word = 2

For performance, you will need following indexing:

Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

Composite Index on (pid, word) in the paper_word_relation table.

~~We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.~~

SELECT COUNT(*) 

FROM (

      SELECT a.pid_from  

      FROM paper_citation_relation AS a 

      INNER JOIN paper_word_relation AS b 

        ON (a.pid_from = b.pid OR 

            a.pid_to = b.pid)  

      GROUP BY a.pid_from 

      HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND 

             SUM(a.pid_to = b.pid AND b.word = 2)

     )

edited Nov 20 '18 at 7:04

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32

1

@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34

Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38

@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40

1

It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00

|
show 4 more comments

After the first 1:n join you get the same pid_to multiple times and your next join is no longer 1:n but n:m, creating a possibly huge intermediate result before the final DISTINCT. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.

You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer

SELECT Count(*) -- no more DISTINCT needed

FROM 

 (

    SELECT DISTINCT cr.pid_to -- reducing m to 1

    FROM paper_citation_relation AS cr

    JOIN paper_word_relation AS wr 

      ON cr.pid_from = wr.pid

    WHERE wr.word = 2

 ) AS dt

JOIN paper_word_relation AS wr

  ON dt.pid_to = wr.pid  -- 1:n join again

WHERE wr.word = 2

If you want to count the number of papers which have been cited you need to get a distinct list of pid (either pid_from or pid_to) from paper_citation_relation first and then join to the specific word.

SELECT Count(*)

FROM

( -- get a unique list of cited or citing papers

    SELECT pid_from AS pid -- citing

    FROM paper_citation_relation

    UNION -- DISTINCT by default

    SELECT pid_to          -- cited

    FROM paper_citation_relation 

) AS dt

JOIN paper_word_relation AS wr

  ON wr.pid = dt.pid

WHERE wr.word = 2 -- now check for the searched word

The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).

edited Nov 20 '18 at 10:56

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18

Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23

Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42

I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53387347%2fhow-can-i-count-rows-in-a-1nn-relation-in-a-faster-way%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.

SELECT COUNT(DISTINCT a.pid_from) 

FROM paper_citation_relation AS a 

INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

WHERE b.word = 2 AND c.word = 2

For performance, you will need following indexing:

Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

Composite Index on (pid, word) in the paper_word_relation table.

~~We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.~~

SELECT COUNT(*) 

FROM (

      SELECT a.pid_from  

      FROM paper_citation_relation AS a 

      INNER JOIN paper_word_relation AS b 

        ON (a.pid_from = b.pid OR 

            a.pid_to = b.pid)  

      GROUP BY a.pid_from 

      HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND 

             SUM(a.pid_to = b.pid AND b.word = 2)

     )

edited Nov 20 '18 at 7:04

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32

1

@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34

Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38

@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40

1

It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00

|
show 4 more comments

I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.

SELECT COUNT(DISTINCT a.pid_from) 

FROM paper_citation_relation AS a 

INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

WHERE b.word = 2 AND c.word = 2

For performance, you will need following indexing:

Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

Composite Index on (pid, word) in the paper_word_relation table.

~~We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.~~

SELECT COUNT(*) 

FROM (

      SELECT a.pid_from  

      FROM paper_citation_relation AS a 

      INNER JOIN paper_word_relation AS b 

        ON (a.pid_from = b.pid OR 

            a.pid_to = b.pid)  

      GROUP BY a.pid_from 

      HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND 

             SUM(a.pid_to = b.pid AND b.word = 2)

     )

edited Nov 20 '18 at 7:04

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32

1

@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34

Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38

@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40

1

It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00

|
show 4 more comments

I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.

SELECT COUNT(DISTINCT a.pid_from) 

FROM paper_citation_relation AS a 

INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

WHERE b.word = 2 AND c.word = 2

For performance, you will need following indexing:

Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

Composite Index on (pid, word) in the paper_word_relation table.

~~We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.~~

SELECT COUNT(*) 

FROM (

      SELECT a.pid_from  

      FROM paper_citation_relation AS a 

      INNER JOIN paper_word_relation AS b 

        ON (a.pid_from = b.pid OR 

            a.pid_to = b.pid)  

      GROUP BY a.pid_from 

      HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND 

             SUM(a.pid_to = b.pid AND b.word = 2)

     )

edited Nov 20 '18 at 7:04

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.

SELECT COUNT(DISTINCT a.pid_from) 

FROM paper_citation_relation AS a 

INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid

INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid

WHERE b.word = 2 AND c.word = 2

For performance, you will need following indexing:

Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

Composite Index on (pid, word) in the paper_word_relation table.

~~We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.~~

SELECT COUNT(*) 

FROM (

      SELECT a.pid_from  

      FROM paper_citation_relation AS a 

      INNER JOIN paper_word_relation AS b 

        ON (a.pid_from = b.pid OR 

            a.pid_to = b.pid)  

      GROUP BY a.pid_from 

      HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND 

             SUM(a.pid_to = b.pid AND b.word = 2)

     )

edited Nov 20 '18 at 7:04

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

edited Nov 20 '18 at 7:04

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

answered Nov 20 '18 at 6:26

Madhur Bhaiya

19.6k62236

Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32

1

@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34

Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38

@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40

1

It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00

|
show 4 more comments

Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32

1

@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34

Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38

@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40

1

It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00

Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32

@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34

Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38

@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40

It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00

|
show 4 more comments

You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer

SELECT Count(*) -- no more DISTINCT needed

FROM 

 (

    SELECT DISTINCT cr.pid_to -- reducing m to 1

    FROM paper_citation_relation AS cr

    JOIN paper_word_relation AS wr 

      ON cr.pid_from = wr.pid

    WHERE wr.word = 2

 ) AS dt

JOIN paper_word_relation AS wr

  ON dt.pid_to = wr.pid  -- 1:n join again

WHERE wr.word = 2

SELECT Count(*)

FROM

( -- get a unique list of cited or citing papers

    SELECT pid_from AS pid -- citing

    FROM paper_citation_relation

    UNION -- DISTINCT by default

    SELECT pid_to          -- cited

    FROM paper_citation_relation 

) AS dt

JOIN paper_word_relation AS wr

  ON wr.pid = dt.pid

WHERE wr.word = 2 -- now check for the searched word

The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).

edited Nov 20 '18 at 10:56

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18

Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23

Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42

I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57

add a comment |

You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer

SELECT Count(*) -- no more DISTINCT needed

FROM 

 (

    SELECT DISTINCT cr.pid_to -- reducing m to 1

    FROM paper_citation_relation AS cr

    JOIN paper_word_relation AS wr 

      ON cr.pid_from = wr.pid

    WHERE wr.word = 2

 ) AS dt

JOIN paper_word_relation AS wr

  ON dt.pid_to = wr.pid  -- 1:n join again

WHERE wr.word = 2

SELECT Count(*)

FROM

( -- get a unique list of cited or citing papers

    SELECT pid_from AS pid -- citing

    FROM paper_citation_relation

    UNION -- DISTINCT by default

    SELECT pid_to          -- cited

    FROM paper_citation_relation 

) AS dt

JOIN paper_word_relation AS wr

  ON wr.pid = dt.pid

WHERE wr.word = 2 -- now check for the searched word

The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).

edited Nov 20 '18 at 10:56

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18

Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23

Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42

I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57

add a comment |

You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer

SELECT Count(*) -- no more DISTINCT needed

FROM 

 (

    SELECT DISTINCT cr.pid_to -- reducing m to 1

    FROM paper_citation_relation AS cr

    JOIN paper_word_relation AS wr 

      ON cr.pid_from = wr.pid

    WHERE wr.word = 2

 ) AS dt

JOIN paper_word_relation AS wr

  ON dt.pid_to = wr.pid  -- 1:n join again

WHERE wr.word = 2

SELECT Count(*)

FROM

( -- get a unique list of cited or citing papers

    SELECT pid_from AS pid -- citing

    FROM paper_citation_relation

    UNION -- DISTINCT by default

    SELECT pid_to          -- cited

    FROM paper_citation_relation 

) AS dt

JOIN paper_word_relation AS wr

  ON wr.pid = dt.pid

WHERE wr.word = 2 -- now check for the searched word

The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).

edited Nov 20 '18 at 10:56

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer

SELECT Count(*) -- no more DISTINCT needed

FROM 

 (

    SELECT DISTINCT cr.pid_to -- reducing m to 1

    FROM paper_citation_relation AS cr

    JOIN paper_word_relation AS wr 

      ON cr.pid_from = wr.pid

    WHERE wr.word = 2

 ) AS dt

JOIN paper_word_relation AS wr

  ON dt.pid_to = wr.pid  -- 1:n join again

WHERE wr.word = 2

SELECT Count(*)

FROM

( -- get a unique list of cited or citing papers

    SELECT pid_from AS pid -- citing

    FROM paper_citation_relation

    UNION -- DISTINCT by default

    SELECT pid_to          -- cited

    FROM paper_citation_relation 

) AS dt

JOIN paper_word_relation AS wr

  ON wr.pid = dt.pid

WHERE wr.word = 2 -- now check for the searched word

The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).

edited Nov 20 '18 at 10:56

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

edited Nov 20 '18 at 10:56

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

answered Nov 20 '18 at 10:01

dnoeth

45.5k31839

Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18

Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23

Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42

I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57

add a comment |

Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18

Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23

Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42

I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57

Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18

Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23

Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42

I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk