Random access to the specified amount of data, is there any better way?
up vote
0
down vote
favorite
MySQL Get random bar data
Scenes
There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.
Suppose there is a data table
sql
Create table topic (
Id int primary key not null
Comment 'number',
Content varchar(20) not null
Comment 'content'
)
Comment 'topic table';
The
topic
table here has two key features
- Primary key can be compared (int
)
- There is a trend in the overall primary key (self-increase/decrease)
Solution 1: Directly use order by rand()
You can get random data directly by using order by rand()
, and you can get all the data (the order is still random).
- According to the result of
rand()
> This step is equivalent to adding a column of data generated by therand()
function to each data, and then sorting the column - Limit the number of queries
sql
Select *
From topic
Order by rand()
Limit 50000;
But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.
Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.
In fact, order by rand()
looks strange, actually equivalent to:
sql
Select *
From
Select
Topic.*,
Rand() as order_column
From topic
) as temp
Order by order_column
Limit 50000;
Solution 2: Use where to take the middle random value
Since the ordering caused by order by rand()
without indexing is too time consuming, we can try to get around this problem.
The following solution is like this
- Take a random value between the minimum and maximum values
- Determine if the id is greater than (or less than) this random value
- Limit the number of queries
sql
Select *
From topic
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.
So, here's the defect of the method
The acquired data is affected by the distribution density
For example, the data distribution is as follows
1,100002,100003,100004...199999,200000
Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change
>=
to<=
, then the average number that can be obtained will be greatly increased (about 7.5w).
The code formatting here has been in error and I can't solve it. . .
Select *
From topic
# Note: The symbols here have been modified.
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
The probability of each piece of data is not exactly the same
Although all the data obtained is random, the probability of each is not the same. For example, when<=
, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to>=
, the first piece of data obtained is generally too small.
Use the result of>=
- The more data is in front, the lower the probability of getting it
- But even with very low probability, there is always a chance at the top, so the first one is generally small
- When the data density is too large, the number obtained will be very small
The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2
, otherwise it will deviate more (not necessarily too large or too small).
Solution 3: Using the temporary table temporary
Solution 2 Focus on avoiding sorting with rand()
without indexing, but think about another solution here, sorting with the added rand()
after indexing. Create a temporary table containing only the primary key id
and the index column randomId
that needs to be sorted, and then get the out-of-order data after the sorting is completed.
sql
Drop temporary table if exists temp_topic;
Create temporary table temp_topic (
Id bigint primary key not null,
randomId double not null,
Index (randomId)
)
As
Select
Id,
Rand() as randomId
From topic;
Select t.*
From topic t
Join (
Select id
From
Select id
From temp_topic
Order by randomId
) as temp
Limit 50000
) as temp
On t.id = temp.id;
The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.
to sum up
Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .
| Differences | order by rand()
| where
| temporary
|
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed is affected by table data complexity | Very large | Very small | Small |
mysql
|
show 4 more comments
up vote
0
down vote
favorite
MySQL Get random bar data
Scenes
There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.
Suppose there is a data table
sql
Create table topic (
Id int primary key not null
Comment 'number',
Content varchar(20) not null
Comment 'content'
)
Comment 'topic table';
The
topic
table here has two key features
- Primary key can be compared (int
)
- There is a trend in the overall primary key (self-increase/decrease)
Solution 1: Directly use order by rand()
You can get random data directly by using order by rand()
, and you can get all the data (the order is still random).
- According to the result of
rand()
> This step is equivalent to adding a column of data generated by therand()
function to each data, and then sorting the column - Limit the number of queries
sql
Select *
From topic
Order by rand()
Limit 50000;
But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.
Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.
In fact, order by rand()
looks strange, actually equivalent to:
sql
Select *
From
Select
Topic.*,
Rand() as order_column
From topic
) as temp
Order by order_column
Limit 50000;
Solution 2: Use where to take the middle random value
Since the ordering caused by order by rand()
without indexing is too time consuming, we can try to get around this problem.
The following solution is like this
- Take a random value between the minimum and maximum values
- Determine if the id is greater than (or less than) this random value
- Limit the number of queries
sql
Select *
From topic
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.
So, here's the defect of the method
The acquired data is affected by the distribution density
For example, the data distribution is as follows
1,100002,100003,100004...199999,200000
Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change
>=
to<=
, then the average number that can be obtained will be greatly increased (about 7.5w).
The code formatting here has been in error and I can't solve it. . .
Select *
From topic
# Note: The symbols here have been modified.
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
The probability of each piece of data is not exactly the same
Although all the data obtained is random, the probability of each is not the same. For example, when<=
, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to>=
, the first piece of data obtained is generally too small.
Use the result of>=
- The more data is in front, the lower the probability of getting it
- But even with very low probability, there is always a chance at the top, so the first one is generally small
- When the data density is too large, the number obtained will be very small
The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2
, otherwise it will deviate more (not necessarily too large or too small).
Solution 3: Using the temporary table temporary
Solution 2 Focus on avoiding sorting with rand()
without indexing, but think about another solution here, sorting with the added rand()
after indexing. Create a temporary table containing only the primary key id
and the index column randomId
that needs to be sorted, and then get the out-of-order data after the sorting is completed.
sql
Drop temporary table if exists temp_topic;
Create temporary table temp_topic (
Id bigint primary key not null,
randomId double not null,
Index (randomId)
)
As
Select
Id,
Rand() as randomId
From topic;
Select t.*
From topic t
Join (
Select id
From
Select id
From temp_topic
Order by randomId
) as temp
Limit 50000
) as temp
On t.id = temp.id;
The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.
to sum up
Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .
| Differences | order by rand()
| where
| temporary
|
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed is affected by table data complexity | Very large | Very small | Small |
mysql
Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48
@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49
If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14
Does it work any faster if you doselect * from topic where id in (select id from topic order by rand() limit 1000)
? BTW you're selecting 50% of rows... is that correct?
– Salman A
Nov 12 at 14:01
First of all, sqlselect * from topic where id in (select id from topic order by rand() limit 1000)
can't run, and the speed of usingorder by rand()
will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
– rxliuli
Nov 13 at 19:45
|
show 4 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
MySQL Get random bar data
Scenes
There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.
Suppose there is a data table
sql
Create table topic (
Id int primary key not null
Comment 'number',
Content varchar(20) not null
Comment 'content'
)
Comment 'topic table';
The
topic
table here has two key features
- Primary key can be compared (int
)
- There is a trend in the overall primary key (self-increase/decrease)
Solution 1: Directly use order by rand()
You can get random data directly by using order by rand()
, and you can get all the data (the order is still random).
- According to the result of
rand()
> This step is equivalent to adding a column of data generated by therand()
function to each data, and then sorting the column - Limit the number of queries
sql
Select *
From topic
Order by rand()
Limit 50000;
But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.
Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.
In fact, order by rand()
looks strange, actually equivalent to:
sql
Select *
From
Select
Topic.*,
Rand() as order_column
From topic
) as temp
Order by order_column
Limit 50000;
Solution 2: Use where to take the middle random value
Since the ordering caused by order by rand()
without indexing is too time consuming, we can try to get around this problem.
The following solution is like this
- Take a random value between the minimum and maximum values
- Determine if the id is greater than (or less than) this random value
- Limit the number of queries
sql
Select *
From topic
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.
So, here's the defect of the method
The acquired data is affected by the distribution density
For example, the data distribution is as follows
1,100002,100003,100004...199999,200000
Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change
>=
to<=
, then the average number that can be obtained will be greatly increased (about 7.5w).
The code formatting here has been in error and I can't solve it. . .
Select *
From topic
# Note: The symbols here have been modified.
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
The probability of each piece of data is not exactly the same
Although all the data obtained is random, the probability of each is not the same. For example, when<=
, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to>=
, the first piece of data obtained is generally too small.
Use the result of>=
- The more data is in front, the lower the probability of getting it
- But even with very low probability, there is always a chance at the top, so the first one is generally small
- When the data density is too large, the number obtained will be very small
The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2
, otherwise it will deviate more (not necessarily too large or too small).
Solution 3: Using the temporary table temporary
Solution 2 Focus on avoiding sorting with rand()
without indexing, but think about another solution here, sorting with the added rand()
after indexing. Create a temporary table containing only the primary key id
and the index column randomId
that needs to be sorted, and then get the out-of-order data after the sorting is completed.
sql
Drop temporary table if exists temp_topic;
Create temporary table temp_topic (
Id bigint primary key not null,
randomId double not null,
Index (randomId)
)
As
Select
Id,
Rand() as randomId
From topic;
Select t.*
From topic t
Join (
Select id
From
Select id
From temp_topic
Order by randomId
) as temp
Limit 50000
) as temp
On t.id = temp.id;
The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.
to sum up
Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .
| Differences | order by rand()
| where
| temporary
|
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed is affected by table data complexity | Very large | Very small | Small |
mysql
MySQL Get random bar data
Scenes
There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.
Suppose there is a data table
sql
Create table topic (
Id int primary key not null
Comment 'number',
Content varchar(20) not null
Comment 'content'
)
Comment 'topic table';
The
topic
table here has two key features
- Primary key can be compared (int
)
- There is a trend in the overall primary key (self-increase/decrease)
Solution 1: Directly use order by rand()
You can get random data directly by using order by rand()
, and you can get all the data (the order is still random).
- According to the result of
rand()
> This step is equivalent to adding a column of data generated by therand()
function to each data, and then sorting the column - Limit the number of queries
sql
Select *
From topic
Order by rand()
Limit 50000;
But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.
Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.
In fact, order by rand()
looks strange, actually equivalent to:
sql
Select *
From
Select
Topic.*,
Rand() as order_column
From topic
) as temp
Order by order_column
Limit 50000;
Solution 2: Use where to take the middle random value
Since the ordering caused by order by rand()
without indexing is too time consuming, we can try to get around this problem.
The following solution is like this
- Take a random value between the minimum and maximum values
- Determine if the id is greater than (or less than) this random value
- Limit the number of queries
sql
Select *
From topic
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.
So, here's the defect of the method
The acquired data is affected by the distribution density
For example, the data distribution is as follows
1,100002,100003,100004...199999,200000
Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change
>=
to<=
, then the average number that can be obtained will be greatly increased (about 7.5w).
The code formatting here has been in error and I can't solve it. . .
Select *
From topic
# Note: The symbols here have been modified.
Where id >= ((select max(id)
From topic)
- (select min(id)
From topic))
* rand()
+ (select min(id)
From topic)
Limit 50000;
The probability of each piece of data is not exactly the same
Although all the data obtained is random, the probability of each is not the same. For example, when<=
, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to>=
, the first piece of data obtained is generally too small.
Use the result of>=
- The more data is in front, the lower the probability of getting it
- But even with very low probability, there is always a chance at the top, so the first one is generally small
- When the data density is too large, the number obtained will be very small
The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2
, otherwise it will deviate more (not necessarily too large or too small).
Solution 3: Using the temporary table temporary
Solution 2 Focus on avoiding sorting with rand()
without indexing, but think about another solution here, sorting with the added rand()
after indexing. Create a temporary table containing only the primary key id
and the index column randomId
that needs to be sorted, and then get the out-of-order data after the sorting is completed.
sql
Drop temporary table if exists temp_topic;
Create temporary table temp_topic (
Id bigint primary key not null,
randomId double not null,
Index (randomId)
)
As
Select
Id,
Rand() as randomId
From topic;
Select t.*
From topic t
Join (
Select id
From
Select id
From temp_topic
Order by randomId
) as temp
Limit 50000
) as temp
On t.id = temp.id;
The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.
to sum up
Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .
| Differences | order by rand()
| where
| temporary
|
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed is affected by table data complexity | Very large | Very small | Small |
mysql
mysql
edited Nov 13 at 19:42
asked Nov 10 at 2:32
rxliuli
12
12
Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48
@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49
If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14
Does it work any faster if you doselect * from topic where id in (select id from topic order by rand() limit 1000)
? BTW you're selecting 50% of rows... is that correct?
– Salman A
Nov 12 at 14:01
First of all, sqlselect * from topic where id in (select id from topic order by rand() limit 1000)
can't run, and the speed of usingorder by rand()
will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
– rxliuli
Nov 13 at 19:45
|
show 4 more comments
Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48
@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49
If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14
Does it work any faster if you doselect * from topic where id in (select id from topic order by rand() limit 1000)
? BTW you're selecting 50% of rows... is that correct?
– Salman A
Nov 12 at 14:01
First of all, sqlselect * from topic where id in (select id from topic order by rand() limit 1000)
can't run, and the speed of usingorder by rand()
will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
– rxliuli
Nov 13 at 19:45
Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48
Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48
@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49
@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49
If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14
If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14
Does it work any faster if you do
select * from topic where id in (select id from topic order by rand() limit 1000)
? BTW you're selecting 50% of rows... is that correct?– Salman A
Nov 12 at 14:01
Does it work any faster if you do
select * from topic where id in (select id from topic order by rand() limit 1000)
? BTW you're selecting 50% of rows... is that correct?– Salman A
Nov 12 at 14:01
First of all, sql
select * from topic where id in (select id from topic order by rand() limit 1000)
can't run, and the speed of using order by rand()
will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .– rxliuli
Nov 13 at 19:45
First of all, sql
select * from topic where id in (select id from topic order by rand() limit 1000)
can't run, and the speed of using order by rand()
will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .– rxliuli
Nov 13 at 19:45
|
show 4 more comments
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235542%2frandom-access-to-the-specified-amount-of-data-is-there-any-better-way%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235542%2frandom-access-to-the-specified-amount-of-data-is-there-any-better-way%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48
@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49
If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14
Does it work any faster if you do
select * from topic where id in (select id from topic order by rand() limit 1000)
? BTW you're selecting 50% of rows... is that correct?– Salman A
Nov 12 at 14:01
First of all, sql
select * from topic where id in (select id from topic order by rand() limit 1000)
can't run, and the speed of usingorder by rand()
will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .– rxliuli
Nov 13 at 19:45