Pandas - for loop and find index that has the nearest value
I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.
For example, given the dataframe below, for every group defined by the global_id
, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because
- I would initialize by taking frame 1 as the first frame
- The next frame that is at least 10 frames apart would be frame number 14
- The following frame that is at least 10 frame apart from 14 is 30
As such, the resulting before and after dataframe should look like below
Before
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
1 0001 crossing 0001 0002 1
2 0001 crossing 0001 0003 1
3 0001 crossing 0001 0004 1
4 0001 crossing 0001 0005 1
5 0001 crossing 0001 0006 1
6 0001 crossing 0001 0007 1
7 0001 crossing 0001 0008 1
8 0001 crossing 0001 0009 1
9 0001 crossing 0001 0010 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
12 0001 crossing 0002 0013 2
13 0001 crossing 0002 0014 2
14 0001 crossing 0002 0015 2
15 0001 crossing 0002 0029 2
16 0001 crossing 0002 0030 2
17 0001 crossing 0002 0031 2
18 0001 crossing 0002 0032 2
19 0001 crossing 0002 0033 2
20 0002 crossing 0001 0034 3
21 0002 crossing 0001 0035 3
22 0002 crossing 0001 0036 3
23 0002 crossing 0001 0037 3
24 0002 crossing 0001 0038 3
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
27 0002 crossing 0001 0050 3
28 0002 crossing 0001 0051 3
29 0002 crossing 0001 0052 3
After filter
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
15 0001 crossing 0002 0029 2
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.
ind =
for j in df["global_id"].unique():
df_temp = df[df["global_id"] == j][["frame_no"]]
df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
start_frame = df_temp["frame_no"].min()
end_frame = df_temp["frame_no"].max()
i = start_frame-1
while i < end_frame:
ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
i+=10
python pandas
add a comment |
I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.
For example, given the dataframe below, for every group defined by the global_id
, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because
- I would initialize by taking frame 1 as the first frame
- The next frame that is at least 10 frames apart would be frame number 14
- The following frame that is at least 10 frame apart from 14 is 30
As such, the resulting before and after dataframe should look like below
Before
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
1 0001 crossing 0001 0002 1
2 0001 crossing 0001 0003 1
3 0001 crossing 0001 0004 1
4 0001 crossing 0001 0005 1
5 0001 crossing 0001 0006 1
6 0001 crossing 0001 0007 1
7 0001 crossing 0001 0008 1
8 0001 crossing 0001 0009 1
9 0001 crossing 0001 0010 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
12 0001 crossing 0002 0013 2
13 0001 crossing 0002 0014 2
14 0001 crossing 0002 0015 2
15 0001 crossing 0002 0029 2
16 0001 crossing 0002 0030 2
17 0001 crossing 0002 0031 2
18 0001 crossing 0002 0032 2
19 0001 crossing 0002 0033 2
20 0002 crossing 0001 0034 3
21 0002 crossing 0001 0035 3
22 0002 crossing 0001 0036 3
23 0002 crossing 0001 0037 3
24 0002 crossing 0001 0038 3
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
27 0002 crossing 0001 0050 3
28 0002 crossing 0001 0051 3
29 0002 crossing 0001 0052 3
After filter
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
15 0001 crossing 0002 0029 2
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.
ind =
for j in df["global_id"].unique():
df_temp = df[df["global_id"] == j][["frame_no"]]
df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
start_frame = df_temp["frame_no"].min()
end_frame = df_temp["frame_no"].max()
i = start_frame-1
while i < end_frame:
ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
i+=10
python pandas
add a comment |
I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.
For example, given the dataframe below, for every group defined by the global_id
, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because
- I would initialize by taking frame 1 as the first frame
- The next frame that is at least 10 frames apart would be frame number 14
- The following frame that is at least 10 frame apart from 14 is 30
As such, the resulting before and after dataframe should look like below
Before
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
1 0001 crossing 0001 0002 1
2 0001 crossing 0001 0003 1
3 0001 crossing 0001 0004 1
4 0001 crossing 0001 0005 1
5 0001 crossing 0001 0006 1
6 0001 crossing 0001 0007 1
7 0001 crossing 0001 0008 1
8 0001 crossing 0001 0009 1
9 0001 crossing 0001 0010 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
12 0001 crossing 0002 0013 2
13 0001 crossing 0002 0014 2
14 0001 crossing 0002 0015 2
15 0001 crossing 0002 0029 2
16 0001 crossing 0002 0030 2
17 0001 crossing 0002 0031 2
18 0001 crossing 0002 0032 2
19 0001 crossing 0002 0033 2
20 0002 crossing 0001 0034 3
21 0002 crossing 0001 0035 3
22 0002 crossing 0001 0036 3
23 0002 crossing 0001 0037 3
24 0002 crossing 0001 0038 3
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
27 0002 crossing 0001 0050 3
28 0002 crossing 0001 0051 3
29 0002 crossing 0001 0052 3
After filter
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
15 0001 crossing 0002 0029 2
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.
ind =
for j in df["global_id"].unique():
df_temp = df[df["global_id"] == j][["frame_no"]]
df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
start_frame = df_temp["frame_no"].min()
end_frame = df_temp["frame_no"].max()
i = start_frame-1
while i < end_frame:
ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
i+=10
python pandas
I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.
For example, given the dataframe below, for every group defined by the global_id
, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because
- I would initialize by taking frame 1 as the first frame
- The next frame that is at least 10 frames apart would be frame number 14
- The following frame that is at least 10 frame apart from 14 is 30
As such, the resulting before and after dataframe should look like below
Before
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
1 0001 crossing 0001 0002 1
2 0001 crossing 0001 0003 1
3 0001 crossing 0001 0004 1
4 0001 crossing 0001 0005 1
5 0001 crossing 0001 0006 1
6 0001 crossing 0001 0007 1
7 0001 crossing 0001 0008 1
8 0001 crossing 0001 0009 1
9 0001 crossing 0001 0010 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
12 0001 crossing 0002 0013 2
13 0001 crossing 0002 0014 2
14 0001 crossing 0002 0015 2
15 0001 crossing 0002 0029 2
16 0001 crossing 0002 0030 2
17 0001 crossing 0002 0031 2
18 0001 crossing 0002 0032 2
19 0001 crossing 0002 0033 2
20 0002 crossing 0001 0034 3
21 0002 crossing 0001 0035 3
22 0002 crossing 0001 0036 3
23 0002 crossing 0001 0037 3
24 0002 crossing 0001 0038 3
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
27 0002 crossing 0001 0050 3
28 0002 crossing 0001 0051 3
29 0002 crossing 0001 0052 3
After filter
seq_name label pedestrian_id frame_no global_id
0 0001 crossing 0001 0001 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
15 0001 crossing 0002 0029 2
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.
ind =
for j in df["global_id"].unique():
df_temp = df[df["global_id"] == j][["frame_no"]]
df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
start_frame = df_temp["frame_no"].min()
end_frame = df_temp["frame_no"].max()
i = start_frame-1
while i < end_frame:
ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
i+=10
python pandas
python pandas
asked Nov 13 '18 at 0:37
KongKong
5671823
5671823
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Here is one way using groupby
but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})
I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract
combined with outer
to get all the differences between each element one to one
print (np.subtract.outer(dfs.a, dfs.a))
array([[ 0, -1, -2, -3, -13, -19, -29, -30],
[ 1, 0, -1, -2, -12, -18, -28, -29],
[ 2, 1, 0, -1, -11, -17, -27, -28],
[ 3, 2, 1, 0, -10, -16, -26, -27],
[ 13, 12, 11, 10, 0, -6, -16, -17],
[ 19, 18, 17, 16, 6, 0, -10, -11],
[ 29, 28, 27, 26, 16, 10, 0, -1],
[ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)
Now, for example, in column 0
, you can see that difference >10
start at row 4
, then going to column 4
, difference >10
start at row 6
and going to column 6
you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer
to 10 and sum
over the axis=0
such as:
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
print (arr)
array([4, 4, 4, 5, 6, 7, 8, 8])
Now you see, arr[0] = 4
, then arr[4] = 6
, then arr[6]=8
is out of bound in this example so stop. One way to catch these number is using a while
(if someone has a numpy
solution to do this, I'm interested in)
list_ind = [0] # initialize list of index to keep with 0
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
i = arr[0]
while i < len(arr):
list_ind.append(i)
i = arr[i]
print (list_ind)
[0, 4, 6]
print (dfs.iloc[list_ind])
a
0 1
4 14
6 30
Now with the whole problem and groupby
, you can do:
# it seems you need to convert the column frame_no to integer
df['frame_int'] = pd.to_numeric(df['frame_no'])
df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id
#define the function looking for the ind
def find_ind (df_g):
list_ind = [0]
arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
i = arr[0]
while i <len(arr):
list_ind.append(i)
i = arr[i]
return df_g.iloc[list_ind]
#create the filtered dataframe
df_filtered = (df.groupby('global_id').apply(find_ind)
.drop('frame_int',axis=1).reset_index(drop=True))
print (df_filtered)
seq_name label pedestrian_id frame_no global_id
0 1 crossing 1 1 1
1 1 crossing 2 1 2
2 1 crossing 2 12 2
3 1 crossing 2 29 2
4 2 crossing 1 34 3
5 2 crossing 1 49 3
If you want to keep the original rows' index, you can add level=0
in reset_index
such as reset_index(level=0,drop=True)
instead.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272116%2fpandas-for-loop-and-find-index-that-has-the-nearest-value%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is one way using groupby
but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})
I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract
combined with outer
to get all the differences between each element one to one
print (np.subtract.outer(dfs.a, dfs.a))
array([[ 0, -1, -2, -3, -13, -19, -29, -30],
[ 1, 0, -1, -2, -12, -18, -28, -29],
[ 2, 1, 0, -1, -11, -17, -27, -28],
[ 3, 2, 1, 0, -10, -16, -26, -27],
[ 13, 12, 11, 10, 0, -6, -16, -17],
[ 19, 18, 17, 16, 6, 0, -10, -11],
[ 29, 28, 27, 26, 16, 10, 0, -1],
[ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)
Now, for example, in column 0
, you can see that difference >10
start at row 4
, then going to column 4
, difference >10
start at row 6
and going to column 6
you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer
to 10 and sum
over the axis=0
such as:
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
print (arr)
array([4, 4, 4, 5, 6, 7, 8, 8])
Now you see, arr[0] = 4
, then arr[4] = 6
, then arr[6]=8
is out of bound in this example so stop. One way to catch these number is using a while
(if someone has a numpy
solution to do this, I'm interested in)
list_ind = [0] # initialize list of index to keep with 0
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
i = arr[0]
while i < len(arr):
list_ind.append(i)
i = arr[i]
print (list_ind)
[0, 4, 6]
print (dfs.iloc[list_ind])
a
0 1
4 14
6 30
Now with the whole problem and groupby
, you can do:
# it seems you need to convert the column frame_no to integer
df['frame_int'] = pd.to_numeric(df['frame_no'])
df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id
#define the function looking for the ind
def find_ind (df_g):
list_ind = [0]
arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
i = arr[0]
while i <len(arr):
list_ind.append(i)
i = arr[i]
return df_g.iloc[list_ind]
#create the filtered dataframe
df_filtered = (df.groupby('global_id').apply(find_ind)
.drop('frame_int',axis=1).reset_index(drop=True))
print (df_filtered)
seq_name label pedestrian_id frame_no global_id
0 1 crossing 1 1 1
1 1 crossing 2 1 2
2 1 crossing 2 12 2
3 1 crossing 2 29 2
4 2 crossing 1 34 3
5 2 crossing 1 49 3
If you want to keep the original rows' index, you can add level=0
in reset_index
such as reset_index(level=0,drop=True)
instead.
add a comment |
Here is one way using groupby
but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})
I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract
combined with outer
to get all the differences between each element one to one
print (np.subtract.outer(dfs.a, dfs.a))
array([[ 0, -1, -2, -3, -13, -19, -29, -30],
[ 1, 0, -1, -2, -12, -18, -28, -29],
[ 2, 1, 0, -1, -11, -17, -27, -28],
[ 3, 2, 1, 0, -10, -16, -26, -27],
[ 13, 12, 11, 10, 0, -6, -16, -17],
[ 19, 18, 17, 16, 6, 0, -10, -11],
[ 29, 28, 27, 26, 16, 10, 0, -1],
[ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)
Now, for example, in column 0
, you can see that difference >10
start at row 4
, then going to column 4
, difference >10
start at row 6
and going to column 6
you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer
to 10 and sum
over the axis=0
such as:
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
print (arr)
array([4, 4, 4, 5, 6, 7, 8, 8])
Now you see, arr[0] = 4
, then arr[4] = 6
, then arr[6]=8
is out of bound in this example so stop. One way to catch these number is using a while
(if someone has a numpy
solution to do this, I'm interested in)
list_ind = [0] # initialize list of index to keep with 0
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
i = arr[0]
while i < len(arr):
list_ind.append(i)
i = arr[i]
print (list_ind)
[0, 4, 6]
print (dfs.iloc[list_ind])
a
0 1
4 14
6 30
Now with the whole problem and groupby
, you can do:
# it seems you need to convert the column frame_no to integer
df['frame_int'] = pd.to_numeric(df['frame_no'])
df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id
#define the function looking for the ind
def find_ind (df_g):
list_ind = [0]
arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
i = arr[0]
while i <len(arr):
list_ind.append(i)
i = arr[i]
return df_g.iloc[list_ind]
#create the filtered dataframe
df_filtered = (df.groupby('global_id').apply(find_ind)
.drop('frame_int',axis=1).reset_index(drop=True))
print (df_filtered)
seq_name label pedestrian_id frame_no global_id
0 1 crossing 1 1 1
1 1 crossing 2 1 2
2 1 crossing 2 12 2
3 1 crossing 2 29 2
4 2 crossing 1 34 3
5 2 crossing 1 49 3
If you want to keep the original rows' index, you can add level=0
in reset_index
such as reset_index(level=0,drop=True)
instead.
add a comment |
Here is one way using groupby
but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})
I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract
combined with outer
to get all the differences between each element one to one
print (np.subtract.outer(dfs.a, dfs.a))
array([[ 0, -1, -2, -3, -13, -19, -29, -30],
[ 1, 0, -1, -2, -12, -18, -28, -29],
[ 2, 1, 0, -1, -11, -17, -27, -28],
[ 3, 2, 1, 0, -10, -16, -26, -27],
[ 13, 12, 11, 10, 0, -6, -16, -17],
[ 19, 18, 17, 16, 6, 0, -10, -11],
[ 29, 28, 27, 26, 16, 10, 0, -1],
[ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)
Now, for example, in column 0
, you can see that difference >10
start at row 4
, then going to column 4
, difference >10
start at row 6
and going to column 6
you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer
to 10 and sum
over the axis=0
such as:
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
print (arr)
array([4, 4, 4, 5, 6, 7, 8, 8])
Now you see, arr[0] = 4
, then arr[4] = 6
, then arr[6]=8
is out of bound in this example so stop. One way to catch these number is using a while
(if someone has a numpy
solution to do this, I'm interested in)
list_ind = [0] # initialize list of index to keep with 0
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
i = arr[0]
while i < len(arr):
list_ind.append(i)
i = arr[i]
print (list_ind)
[0, 4, 6]
print (dfs.iloc[list_ind])
a
0 1
4 14
6 30
Now with the whole problem and groupby
, you can do:
# it seems you need to convert the column frame_no to integer
df['frame_int'] = pd.to_numeric(df['frame_no'])
df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id
#define the function looking for the ind
def find_ind (df_g):
list_ind = [0]
arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
i = arr[0]
while i <len(arr):
list_ind.append(i)
i = arr[i]
return df_g.iloc[list_ind]
#create the filtered dataframe
df_filtered = (df.groupby('global_id').apply(find_ind)
.drop('frame_int',axis=1).reset_index(drop=True))
print (df_filtered)
seq_name label pedestrian_id frame_no global_id
0 1 crossing 1 1 1
1 1 crossing 2 1 2
2 1 crossing 2 12 2
3 1 crossing 2 29 2
4 2 crossing 1 34 3
5 2 crossing 1 49 3
If you want to keep the original rows' index, you can add level=0
in reset_index
such as reset_index(level=0,drop=True)
instead.
Here is one way using groupby
but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})
I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract
combined with outer
to get all the differences between each element one to one
print (np.subtract.outer(dfs.a, dfs.a))
array([[ 0, -1, -2, -3, -13, -19, -29, -30],
[ 1, 0, -1, -2, -12, -18, -28, -29],
[ 2, 1, 0, -1, -11, -17, -27, -28],
[ 3, 2, 1, 0, -10, -16, -26, -27],
[ 13, 12, 11, 10, 0, -6, -16, -17],
[ 19, 18, 17, 16, 6, 0, -10, -11],
[ 29, 28, 27, 26, 16, 10, 0, -1],
[ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)
Now, for example, in column 0
, you can see that difference >10
start at row 4
, then going to column 4
, difference >10
start at row 6
and going to column 6
you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer
to 10 and sum
over the axis=0
such as:
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
print (arr)
array([4, 4, 4, 5, 6, 7, 8, 8])
Now you see, arr[0] = 4
, then arr[4] = 6
, then arr[6]=8
is out of bound in this example so stop. One way to catch these number is using a while
(if someone has a numpy
solution to do this, I'm interested in)
list_ind = [0] # initialize list of index to keep with 0
arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
i = arr[0]
while i < len(arr):
list_ind.append(i)
i = arr[i]
print (list_ind)
[0, 4, 6]
print (dfs.iloc[list_ind])
a
0 1
4 14
6 30
Now with the whole problem and groupby
, you can do:
# it seems you need to convert the column frame_no to integer
df['frame_int'] = pd.to_numeric(df['frame_no'])
df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id
#define the function looking for the ind
def find_ind (df_g):
list_ind = [0]
arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
i = arr[0]
while i <len(arr):
list_ind.append(i)
i = arr[i]
return df_g.iloc[list_ind]
#create the filtered dataframe
df_filtered = (df.groupby('global_id').apply(find_ind)
.drop('frame_int',axis=1).reset_index(drop=True))
print (df_filtered)
seq_name label pedestrian_id frame_no global_id
0 1 crossing 1 1 1
1 1 crossing 2 1 2
2 1 crossing 2 12 2
3 1 crossing 2 29 2
4 2 crossing 1 34 3
5 2 crossing 1 49 3
If you want to keep the original rows' index, you can add level=0
in reset_index
such as reset_index(level=0,drop=True)
instead.
answered Nov 13 '18 at 16:27
Ben.TBen.T
5,9572523
5,9572523
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272116%2fpandas-for-loop-and-find-index-that-has-the-nearest-value%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown