How can I count rows in a 1:N:N relation in a faster way?












1















This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.



I have 3 tables in my MySQL database, their structure is shown below:




  1. word_list (5 million rows)




+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+




  1. paper_word_relation (10 million rows)




+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+




  1. paper_citation_relation (80K rows)




+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+



I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)



I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d



How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?



I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)



Thanks!










share|improve this question























  • Which version of MySQL? What indexes do you have on your tables?

    – oysteing
    Nov 20 '18 at 23:07
















1















This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.



I have 3 tables in my MySQL database, their structure is shown below:




  1. word_list (5 million rows)




+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+




  1. paper_word_relation (10 million rows)




+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+




  1. paper_citation_relation (80K rows)




+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+



I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)



I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d



How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?



I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)



Thanks!










share|improve this question























  • Which version of MySQL? What indexes do you have on your tables?

    – oysteing
    Nov 20 '18 at 23:07














1












1








1








This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.



I have 3 tables in my MySQL database, their structure is shown below:




  1. word_list (5 million rows)




+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+




  1. paper_word_relation (10 million rows)




+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+




  1. paper_citation_relation (80K rows)




+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+



I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)



I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d



How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?



I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)



Thanks!










share|improve this question














This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.



I have 3 tables in my MySQL database, their structure is shown below:




  1. word_list (5 million rows)




+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+




  1. paper_word_relation (10 million rows)




+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+




  1. paper_citation_relation (80K rows)




+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+



I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)



I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code





SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d



How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?



I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation table. (About 4 million words only occur once)



Thanks!







mysql sql






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 20 '18 at 6:23









TomLeungTomLeung

387




387













  • Which version of MySQL? What indexes do you have on your tables?

    – oysteing
    Nov 20 '18 at 23:07



















  • Which version of MySQL? What indexes do you have on your tables?

    – oysteing
    Nov 20 '18 at 23:07

















Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07





Which version of MySQL? What indexes do you have on your tables?

– oysteing
Nov 20 '18 at 23:07












2 Answers
2






active

oldest

votes


















1














If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.



I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.



SELECT COUNT(DISTINCT a.pid_from) 
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2


For performance, you will need following indexing:




  • Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

  • Composite Index on (pid, word) in the paper_word_relation table.




We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.



SELECT COUNT(*) 
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)







share|improve this answer


























  • Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

    – TomLeung
    Nov 20 '18 at 6:32






  • 1





    @TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

    – Madhur Bhaiya
    Nov 20 '18 at 6:34











  • Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

    – TomLeung
    Nov 20 '18 at 6:38











  • @TomLeung can you add the composite indexing as specified by me, and test.

    – Madhur Bhaiya
    Nov 20 '18 at 6:40






  • 1





    It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

    – TomLeung
    Nov 20 '18 at 7:00



















0














After the first 1:n join you get the same pid_to multiple times and your next join is no longer 1:n but n:m, creating a possibly huge intermediate result before the final DISTINCT. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.



You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer



SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2


If you want to count the number of papers which have been cited you need to get a distinct list of pid (either pid_from or pid_to) from paper_citation_relation first and then join to the specific word.



SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word


The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).






share|improve this answer


























  • Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

    – TomLeung
    Nov 20 '18 at 10:18











  • Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

    – dnoeth
    Nov 20 '18 at 10:23













  • Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

    – TomLeung
    Nov 20 '18 at 10:42











  • I added another variation, which might return slightly different number due to different logic.

    – dnoeth
    Nov 20 '18 at 10:57











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53387347%2fhow-can-i-count-rows-in-a-1nn-relation-in-a-faster-way%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.



I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.



SELECT COUNT(DISTINCT a.pid_from) 
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2


For performance, you will need following indexing:




  • Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

  • Composite Index on (pid, word) in the paper_word_relation table.




We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.



SELECT COUNT(*) 
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)







share|improve this answer


























  • Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

    – TomLeung
    Nov 20 '18 at 6:32






  • 1





    @TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

    – Madhur Bhaiya
    Nov 20 '18 at 6:34











  • Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

    – TomLeung
    Nov 20 '18 at 6:38











  • @TomLeung can you add the composite indexing as specified by me, and test.

    – Madhur Bhaiya
    Nov 20 '18 at 6:40






  • 1





    It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

    – TomLeung
    Nov 20 '18 at 7:00
















1














If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.



I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.



SELECT COUNT(DISTINCT a.pid_from) 
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2


For performance, you will need following indexing:




  • Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

  • Composite Index on (pid, word) in the paper_word_relation table.




We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.



SELECT COUNT(*) 
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)







share|improve this answer


























  • Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

    – TomLeung
    Nov 20 '18 at 6:32






  • 1





    @TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

    – Madhur Bhaiya
    Nov 20 '18 at 6:34











  • Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

    – TomLeung
    Nov 20 '18 at 6:38











  • @TomLeung can you add the composite indexing as specified by me, and test.

    – Madhur Bhaiya
    Nov 20 '18 at 6:40






  • 1





    It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

    – TomLeung
    Nov 20 '18 at 7:00














1












1








1







If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.



I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.



SELECT COUNT(DISTINCT a.pid_from) 
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2


For performance, you will need following indexing:




  • Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

  • Composite Index on (pid, word) in the paper_word_relation table.




We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.



SELECT COUNT(*) 
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)







share|improve this answer















If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.



I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation table, duplicate rows may be coming for a single paper.



SELECT COUNT(DISTINCT a.pid_from) 
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2


For performance, you will need following indexing:




  • Composite Index on (pid_from, pid_to) in the paper_citation_relation table.

  • Composite Index on (pid, word) in the paper_word_relation table.




We may also possibly optimize the query further by reducing one join and use conditional AND/OR based filtering in HAVING. You will need to benchmark it though.



SELECT COUNT(*) 
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)








share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 7:04

























answered Nov 20 '18 at 6:26









Madhur BhaiyaMadhur Bhaiya

19.6k62236




19.6k62236













  • Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

    – TomLeung
    Nov 20 '18 at 6:32






  • 1





    @TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

    – Madhur Bhaiya
    Nov 20 '18 at 6:34











  • Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

    – TomLeung
    Nov 20 '18 at 6:38











  • @TomLeung can you add the composite indexing as specified by me, and test.

    – Madhur Bhaiya
    Nov 20 '18 at 6:40






  • 1





    It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

    – TomLeung
    Nov 20 '18 at 7:00



















  • Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

    – TomLeung
    Nov 20 '18 at 6:32






  • 1





    @TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

    – Madhur Bhaiya
    Nov 20 '18 at 6:34











  • Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

    – TomLeung
    Nov 20 '18 at 6:38











  • @TomLeung can you add the composite indexing as specified by me, and test.

    – Madhur Bhaiya
    Nov 20 '18 at 6:40






  • 1





    It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

    – TomLeung
    Nov 20 '18 at 7:00

















Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32





Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?

– TomLeung
Nov 20 '18 at 6:32




1




1





@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34





@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?

– Madhur Bhaiya
Nov 20 '18 at 6:34













Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38





Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.

– TomLeung
Nov 20 '18 at 6:38













@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40





@TomLeung can you add the composite indexing as specified by me, and test.

– Madhur Bhaiya
Nov 20 '18 at 6:40




1




1





It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00





It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.

– TomLeung
Nov 20 '18 at 7:00













0














After the first 1:n join you get the same pid_to multiple times and your next join is no longer 1:n but n:m, creating a possibly huge intermediate result before the final DISTINCT. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.



You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer



SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2


If you want to count the number of papers which have been cited you need to get a distinct list of pid (either pid_from or pid_to) from paper_citation_relation first and then join to the specific word.



SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word


The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).






share|improve this answer


























  • Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

    – TomLeung
    Nov 20 '18 at 10:18











  • Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

    – dnoeth
    Nov 20 '18 at 10:23













  • Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

    – TomLeung
    Nov 20 '18 at 10:42











  • I added another variation, which might return slightly different number due to different logic.

    – dnoeth
    Nov 20 '18 at 10:57
















0














After the first 1:n join you get the same pid_to multiple times and your next join is no longer 1:n but n:m, creating a possibly huge intermediate result before the final DISTINCT. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.



You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer



SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2


If you want to count the number of papers which have been cited you need to get a distinct list of pid (either pid_from or pid_to) from paper_citation_relation first and then join to the specific word.



SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word


The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).






share|improve this answer


























  • Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

    – TomLeung
    Nov 20 '18 at 10:18











  • Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

    – dnoeth
    Nov 20 '18 at 10:23













  • Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

    – TomLeung
    Nov 20 '18 at 10:42











  • I added another variation, which might return slightly different number due to different logic.

    – dnoeth
    Nov 20 '18 at 10:57














0












0








0







After the first 1:n join you get the same pid_to multiple times and your next join is no longer 1:n but n:m, creating a possibly huge intermediate result before the final DISTINCT. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.



You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer



SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2


If you want to count the number of papers which have been cited you need to get a distinct list of pid (either pid_from or pid_to) from paper_citation_relation first and then join to the specific word.



SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word


The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).






share|improve this answer















After the first 1:n join you get the same pid_to multiple times and your next join is no longer 1:n but n:m, creating a possibly huge intermediate result before the final DISTINCT. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.



You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer



SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2


If you want to count the number of papers which have been cited you need to get a distinct list of pid (either pid_from or pid_to) from paper_citation_relation first and then join to the specific word.



SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word


The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 10:56

























answered Nov 20 '18 at 10:01









dnoethdnoeth

45.5k31839




45.5k31839













  • Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

    – TomLeung
    Nov 20 '18 at 10:18











  • Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

    – dnoeth
    Nov 20 '18 at 10:23













  • Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

    – TomLeung
    Nov 20 '18 at 10:42











  • I added another variation, which might return slightly different number due to different logic.

    – dnoeth
    Nov 20 '18 at 10:57



















  • Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

    – TomLeung
    Nov 20 '18 at 10:18











  • Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

    – dnoeth
    Nov 20 '18 at 10:23













  • Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

    – TomLeung
    Nov 20 '18 at 10:42











  • I added another variation, which might return slightly different number due to different logic.

    – dnoeth
    Nov 20 '18 at 10:57

















Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18





Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...

– TomLeung
Nov 20 '18 at 10:18













Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23







Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do COUNT(DISTINCT a.pid_TO) in @MadhurBhaiya's query.

– dnoeth
Nov 20 '18 at 10:23















Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42





Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once

– TomLeung
Nov 20 '18 at 10:42













I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57





I added another variation, which might return slightly different number due to different logic.

– dnoeth
Nov 20 '18 at 10:57


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53387347%2fhow-can-i-count-rows-in-a-1nn-relation-in-a-faster-way%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini