How can I count rows in a 1:N:N relation in a faster way?
This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.
I have 3 tables in my MySQL database, their structure is shown below:
- word_list (5 million rows)
+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+
- paper_word_relation (10 million rows)
+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+
- paper_citation_relation (80K rows)
+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+
I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)
I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code
SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d
How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?
I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation
table. (About 4 million words only occur once)
Thanks!
mysql sql
add a comment |
This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.
I have 3 tables in my MySQL database, their structure is shown below:
- word_list (5 million rows)
+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+
- paper_word_relation (10 million rows)
+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+
- paper_citation_relation (80K rows)
+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+
I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)
I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code
SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d
How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?
I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation
table. (About 4 million words only occur once)
Thanks!
mysql sql
Which version of MySQL? What indexes do you have on your tables?
– oysteing
Nov 20 '18 at 23:07
add a comment |
This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.
I have 3 tables in my MySQL database, their structure is shown below:
- word_list (5 million rows)
+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+
- paper_word_relation (10 million rows)
+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+
- paper_citation_relation (80K rows)
+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+
I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)
I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code
SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d
How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?
I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation
table. (About 4 million words only occur once)
Thanks!
mysql sql
This question is a bit complicated to me, and I can't explain it in one sentence so the title may seem quite ambiguous.
I have 3 tables in my MySQL database, their structure is shown below:
- word_list (5 million rows)
+-----+--------+
| wid | word |
+-----+--------+
| 1 | foo |
| 2 | bar |
| 3 | hello |
+-----+--------+
- paper_word_relation (10 million rows)
+-----+-------+
| pid | word |
+-----+-------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
+-----+-------+
- paper_citation_relation (80K rows)
+----------+--------+
| pid_from | pid_to |
+----------+--------+
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 3 |
+----------+--------+
I want to find out how many papers contain word W, and cite the papers also contain word W.(for each word in the list)
I use two inner join to do this job but it seems extremely slow when the word is popular - above 50s (quite fast if the word is rarely used - below 0.1s), here is my code
SELECT COUNT(*) FROM (
SELECT a.pid_from, a.pid_to, b.word FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2) AS d
How can I do this faster? Is my query not efficient enough or it's the problem about the amount of data?
I can only come up with one solution that I delete the words which occur less than 2 in the paper_word_relation
table. (About 4 million words only occur once)
Thanks!
mysql sql
mysql sql
asked Nov 20 '18 at 6:23
TomLeungTomLeung
387
387
Which version of MySQL? What indexes do you have on your tables?
– oysteing
Nov 20 '18 at 23:07
add a comment |
Which version of MySQL? What indexes do you have on your tables?
– oysteing
Nov 20 '18 at 23:07
Which version of MySQL? What indexes do you have on your tables?
– oysteing
Nov 20 '18 at 23:07
Which version of MySQL? What indexes do you have on your tables?
– oysteing
Nov 20 '18 at 23:07
add a comment |
2 Answers
2
active
oldest
votes
If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.
I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation
table, duplicate rows may be coming for a single paper.
SELECT COUNT(DISTINCT a.pid_from)
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2
For performance, you will need following indexing:
- Composite Index on
(pid_from, pid_to)
in thepaper_citation_relation
table. - Composite Index on
(pid, word)
in thepaper_word_relation
table.
We may also possibly optimize the query further by reducing one join and use conditional AND/OR
based filtering in HAVING
. You will need to benchmark it though.
SELECT COUNT(*)
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
1
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
1
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
|
show 4 more comments
After the first 1:n
join you get the same pid_to
multiple times and your next join is no longer 1:n
but n:m
, creating a possibly huge intermediate result before the final DISTINCT
. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.
You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer
SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2
If you want to count the number of papers which have been cited you need to get a distinct list of pid
(either pid_from
or pid_to
) from paper_citation_relation
first and then join to the specific word.
SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word
The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you doCOUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.
– dnoeth
Nov 20 '18 at 10:23
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53387347%2fhow-can-i-count-rows-in-a-1nn-relation-in-a-faster-way%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.
I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation
table, duplicate rows may be coming for a single paper.
SELECT COUNT(DISTINCT a.pid_from)
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2
For performance, you will need following indexing:
- Composite Index on
(pid_from, pid_to)
in thepaper_citation_relation
table. - Composite Index on
(pid, word)
in thepaper_word_relation
table.
We may also possibly optimize the query further by reducing one join and use conditional AND/OR
based filtering in HAVING
. You will need to benchmark it though.
SELECT COUNT(*)
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
1
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
1
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
|
show 4 more comments
If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.
I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation
table, duplicate rows may be coming for a single paper.
SELECT COUNT(DISTINCT a.pid_from)
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2
For performance, you will need following indexing:
- Composite Index on
(pid_from, pid_to)
in thepaper_citation_relation
table. - Composite Index on
(pid, word)
in thepaper_word_relation
table.
We may also possibly optimize the query further by reducing one join and use conditional AND/OR
based filtering in HAVING
. You will need to benchmark it though.
SELECT COUNT(*)
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
1
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
1
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
|
show 4 more comments
If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.
I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation
table, duplicate rows may be coming for a single paper.
SELECT COUNT(DISTINCT a.pid_from)
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2
For performance, you will need following indexing:
- Composite Index on
(pid_from, pid_to)
in thepaper_citation_relation
table. - Composite Index on
(pid, word)
in thepaper_word_relation
table.
We may also possibly optimize the query further by reducing one join and use conditional AND/OR
based filtering in HAVING
. You will need to benchmark it though.
SELECT COUNT(*)
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)
If you are only concerned with getting the Count, you should not be first getting the results into a Derived Table, and then Count the rows out. This may create unnecessary temporary tables storing lots of data in-memory. You can directly count the number of rows.
I also think that you need to count unique number of papers. Because of Many-to-Many relationships in paper_citation_relation
table, duplicate rows may be coming for a single paper.
SELECT COUNT(DISTINCT a.pid_from)
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b ON a.pid_from = b.pid
INNER JOIN paper_word_relation AS c ON a.pid_to = c.pid
WHERE b.word = 2 AND c.word = 2
For performance, you will need following indexing:
- Composite Index on
(pid_from, pid_to)
in thepaper_citation_relation
table. - Composite Index on
(pid, word)
in thepaper_word_relation
table.
We may also possibly optimize the query further by reducing one join and use conditional AND/OR
based filtering in HAVING
. You will need to benchmark it though.
SELECT COUNT(*)
FROM (
SELECT a.pid_from
FROM paper_citation_relation AS a
INNER JOIN paper_word_relation AS b
ON (a.pid_from = b.pid OR
a.pid_to = b.pid)
GROUP BY a.pid_from
HAVING SUM(a.pid_from = b.pid AND b.word = 2) AND
SUM(a.pid_to = b.pid AND b.word = 2)
)
edited Nov 20 '18 at 7:04
answered Nov 20 '18 at 6:26
Madhur BhaiyaMadhur Bhaiya
19.6k62236
19.6k62236
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
1
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
1
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
|
show 4 more comments
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
1
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
1
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
Thanks! But this modification does not improve much, still cost 65s. (66s originally). And the count is 9465 in this case. Is it the best time I can get with this amount of data?
– TomLeung
Nov 20 '18 at 6:32
1
1
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
@TomLeung what was the original count value ? Check my updated answer for why I think your count logic needs to be improved. Moreover, is there any indexing done on all the tables ?
– Madhur Bhaiya
Nov 20 '18 at 6:34
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
Yes, I do need the DISTINCT logic (4333 after distinct). All column is indexed using BTREE NORMAL index.
– TomLeung
Nov 20 '18 at 6:38
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
@TomLeung can you add the composite indexing as specified by me, and test.
– Madhur Bhaiya
Nov 20 '18 at 6:40
1
1
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
It seems this solution is even worse than the original one with composite index, anyway you really help me a lot! Thanks.
– TomLeung
Nov 20 '18 at 7:00
|
show 4 more comments
After the first 1:n
join you get the same pid_to
multiple times and your next join is no longer 1:n
but n:m
, creating a possibly huge intermediate result before the final DISTINCT
. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.
You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer
SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2
If you want to count the number of papers which have been cited you need to get a distinct list of pid
(either pid_from
or pid_to
) from paper_citation_relation
first and then join to the specific word.
SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word
The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you doCOUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.
– dnoeth
Nov 20 '18 at 10:23
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
add a comment |
After the first 1:n
join you get the same pid_to
multiple times and your next join is no longer 1:n
but n:m
, creating a possibly huge intermediate result before the final DISTINCT
. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.
You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer
SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2
If you want to count the number of papers which have been cited you need to get a distinct list of pid
(either pid_from
or pid_to
) from paper_citation_relation
first and then join to the specific word.
SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word
The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you doCOUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.
– dnoeth
Nov 20 '18 at 10:23
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
add a comment |
After the first 1:n
join you get the same pid_to
multiple times and your next join is no longer 1:n
but n:m
, creating a possibly huge intermediate result before the final DISTINCT
. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.
You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer
SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2
If you want to count the number of papers which have been cited you need to get a distinct list of pid
(either pid_from
or pid_to
) from paper_citation_relation
first and then join to the specific word.
SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word
The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).
After the first 1:n
join you get the same pid_to
multiple times and your next join is no longer 1:n
but n:m
, creating a possibly huge intermediate result before the final DISTINCT
. It's similar to a CROSS JOIN and it's getting worse for popular words, e.g. 10*10 vs. 1000*1000 rows.
You must remove the duplicates before the join, this should return the same number as @MadhurBhaiya's answer
SELECT Count(*) -- no more DISTINCT needed
FROM
(
SELECT DISTINCT cr.pid_to -- reducing m to 1
FROM paper_citation_relation AS cr
JOIN paper_word_relation AS wr
ON cr.pid_from = wr.pid
WHERE wr.word = 2
) AS dt
JOIN paper_word_relation AS wr
ON dt.pid_to = wr.pid -- 1:n join again
WHERE wr.word = 2
If you want to count the number of papers which have been cited you need to get a distinct list of pid
(either pid_from
or pid_to
) from paper_citation_relation
first and then join to the specific word.
SELECT Count(*)
FROM
( -- get a unique list of cited or citing papers
SELECT pid_from AS pid -- citing
FROM paper_citation_relation
UNION -- DISTINCT by default
SELECT pid_to -- cited
FROM paper_citation_relation
) AS dt
JOIN paper_word_relation AS wr
ON wr.pid = dt.pid
WHERE wr.word = 2 -- now check for the searched word
The number returned by this might be slightly higher (it counts a paper regardless if cited or citing).
edited Nov 20 '18 at 10:56
answered Nov 20 '18 at 10:01
dnoethdnoeth
45.5k31839
45.5k31839
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you doCOUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.
– dnoeth
Nov 20 '18 at 10:23
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
add a comment |
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you doCOUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.
– dnoeth
Nov 20 '18 at 10:23
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Thanks, but your query result isn't consistent with @MadhurBhaiya's result, I know very little about SQL query optimization so I don't know why...
– TomLeung
Nov 20 '18 at 10:18
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do
COUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.– dnoeth
Nov 20 '18 at 10:23
Are the number totally different or quite similar? You should find out, what result you actually want and double check if any of the queries return that. E.g. check for differences when you do
COUNT(DISTINCT a.pid_TO)
in @MadhurBhaiya's query.– dnoeth
Nov 20 '18 at 10:23
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
Quite similar. I need that DISTINCT, because if a paper with word W cite many paper with W, it should only count once
– TomLeung
Nov 20 '18 at 10:42
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
I added another variation, which might return slightly different number due to different logic.
– dnoeth
Nov 20 '18 at 10:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53387347%2fhow-can-i-count-rows-in-a-1nn-relation-in-a-faster-way%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Which version of MySQL? What indexes do you have on your tables?
– oysteing
Nov 20 '18 at 23:07