Pandas - for loop and find index that has the nearest value












2














I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.



For example, given the dataframe below, for every group defined by the global_id, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because




  • I would initialize by taking frame 1 as the first frame

  • The next frame that is at least 10 frames apart would be frame number 14

  • The following frame that is at least 10 frame apart from 14 is 30


As such, the resulting before and after dataframe should look like below




Before




       seq_name     label pedestrian_id  frame_no  global_id
0 0001 crossing 0001 0001 1
1 0001 crossing 0001 0002 1
2 0001 crossing 0001 0003 1
3 0001 crossing 0001 0004 1
4 0001 crossing 0001 0005 1
5 0001 crossing 0001 0006 1
6 0001 crossing 0001 0007 1
7 0001 crossing 0001 0008 1
8 0001 crossing 0001 0009 1
9 0001 crossing 0001 0010 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
12 0001 crossing 0002 0013 2
13 0001 crossing 0002 0014 2
14 0001 crossing 0002 0015 2
15 0001 crossing 0002 0029 2
16 0001 crossing 0002 0030 2
17 0001 crossing 0002 0031 2
18 0001 crossing 0002 0032 2
19 0001 crossing 0002 0033 2
20 0002 crossing 0001 0034 3
21 0002 crossing 0001 0035 3
22 0002 crossing 0001 0036 3
23 0002 crossing 0001 0037 3
24 0002 crossing 0001 0038 3
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3
27 0002 crossing 0001 0050 3
28 0002 crossing 0001 0051 3
29 0002 crossing 0001 0052 3



After filter




       seq_name     label pedestrian_id  frame_no  global_id
0 0001 crossing 0001 0001 1
10 0001 crossing 0002 0001 2
11 0001 crossing 0002 0012 2
15 0001 crossing 0002 0029 2
25 0002 crossing 0001 0039 3
26 0002 crossing 0001 0049 3


Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.



ind = 
for j in df["global_id"].unique():
df_temp = df[df["global_id"] == j][["frame_no"]]
df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
start_frame = df_temp["frame_no"].min()
end_frame = df_temp["frame_no"].max()
i = start_frame-1
while i < end_frame:
ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
i+=10









share|improve this question



























    2














    I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.



    For example, given the dataframe below, for every group defined by the global_id, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because




    • I would initialize by taking frame 1 as the first frame

    • The next frame that is at least 10 frames apart would be frame number 14

    • The following frame that is at least 10 frame apart from 14 is 30


    As such, the resulting before and after dataframe should look like below




    Before




           seq_name     label pedestrian_id  frame_no  global_id
    0 0001 crossing 0001 0001 1
    1 0001 crossing 0001 0002 1
    2 0001 crossing 0001 0003 1
    3 0001 crossing 0001 0004 1
    4 0001 crossing 0001 0005 1
    5 0001 crossing 0001 0006 1
    6 0001 crossing 0001 0007 1
    7 0001 crossing 0001 0008 1
    8 0001 crossing 0001 0009 1
    9 0001 crossing 0001 0010 1
    10 0001 crossing 0002 0001 2
    11 0001 crossing 0002 0012 2
    12 0001 crossing 0002 0013 2
    13 0001 crossing 0002 0014 2
    14 0001 crossing 0002 0015 2
    15 0001 crossing 0002 0029 2
    16 0001 crossing 0002 0030 2
    17 0001 crossing 0002 0031 2
    18 0001 crossing 0002 0032 2
    19 0001 crossing 0002 0033 2
    20 0002 crossing 0001 0034 3
    21 0002 crossing 0001 0035 3
    22 0002 crossing 0001 0036 3
    23 0002 crossing 0001 0037 3
    24 0002 crossing 0001 0038 3
    25 0002 crossing 0001 0039 3
    26 0002 crossing 0001 0049 3
    27 0002 crossing 0001 0050 3
    28 0002 crossing 0001 0051 3
    29 0002 crossing 0001 0052 3



    After filter




           seq_name     label pedestrian_id  frame_no  global_id
    0 0001 crossing 0001 0001 1
    10 0001 crossing 0002 0001 2
    11 0001 crossing 0002 0012 2
    15 0001 crossing 0002 0029 2
    25 0002 crossing 0001 0039 3
    26 0002 crossing 0001 0049 3


    Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.



    ind = 
    for j in df["global_id"].unique():
    df_temp = df[df["global_id"] == j][["frame_no"]]
    df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
    start_frame = df_temp["frame_no"].min()
    end_frame = df_temp["frame_no"].max()
    i = start_frame-1
    while i < end_frame:
    ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
    i+=10









    share|improve this question

























      2












      2








      2


      1





      I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.



      For example, given the dataframe below, for every group defined by the global_id, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because




      • I would initialize by taking frame 1 as the first frame

      • The next frame that is at least 10 frames apart would be frame number 14

      • The following frame that is at least 10 frame apart from 14 is 30


      As such, the resulting before and after dataframe should look like below




      Before




             seq_name     label pedestrian_id  frame_no  global_id
      0 0001 crossing 0001 0001 1
      1 0001 crossing 0001 0002 1
      2 0001 crossing 0001 0003 1
      3 0001 crossing 0001 0004 1
      4 0001 crossing 0001 0005 1
      5 0001 crossing 0001 0006 1
      6 0001 crossing 0001 0007 1
      7 0001 crossing 0001 0008 1
      8 0001 crossing 0001 0009 1
      9 0001 crossing 0001 0010 1
      10 0001 crossing 0002 0001 2
      11 0001 crossing 0002 0012 2
      12 0001 crossing 0002 0013 2
      13 0001 crossing 0002 0014 2
      14 0001 crossing 0002 0015 2
      15 0001 crossing 0002 0029 2
      16 0001 crossing 0002 0030 2
      17 0001 crossing 0002 0031 2
      18 0001 crossing 0002 0032 2
      19 0001 crossing 0002 0033 2
      20 0002 crossing 0001 0034 3
      21 0002 crossing 0001 0035 3
      22 0002 crossing 0001 0036 3
      23 0002 crossing 0001 0037 3
      24 0002 crossing 0001 0038 3
      25 0002 crossing 0001 0039 3
      26 0002 crossing 0001 0049 3
      27 0002 crossing 0001 0050 3
      28 0002 crossing 0001 0051 3
      29 0002 crossing 0001 0052 3



      After filter




             seq_name     label pedestrian_id  frame_no  global_id
      0 0001 crossing 0001 0001 1
      10 0001 crossing 0002 0001 2
      11 0001 crossing 0002 0012 2
      15 0001 crossing 0002 0029 2
      25 0002 crossing 0001 0039 3
      26 0002 crossing 0001 0049 3


      Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.



      ind = 
      for j in df["global_id"].unique():
      df_temp = df[df["global_id"] == j][["frame_no"]]
      df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
      start_frame = df_temp["frame_no"].min()
      end_frame = df_temp["frame_no"].max()
      i = start_frame-1
      while i < end_frame:
      ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
      i+=10









      share|improve this question













      I am basically trying to loop through a dataframe that has been grouped and finding the index that has the nearest value to the input argument.



      For example, given the dataframe below, for every group defined by the global_id, I want to group to take frames that are spaced at least 10 frames apart. For example if I have a list of frames [1,2,3,4,14,20,30,31], the output would be [1,14,30] because




      • I would initialize by taking frame 1 as the first frame

      • The next frame that is at least 10 frames apart would be frame number 14

      • The following frame that is at least 10 frame apart from 14 is 30


      As such, the resulting before and after dataframe should look like below




      Before




             seq_name     label pedestrian_id  frame_no  global_id
      0 0001 crossing 0001 0001 1
      1 0001 crossing 0001 0002 1
      2 0001 crossing 0001 0003 1
      3 0001 crossing 0001 0004 1
      4 0001 crossing 0001 0005 1
      5 0001 crossing 0001 0006 1
      6 0001 crossing 0001 0007 1
      7 0001 crossing 0001 0008 1
      8 0001 crossing 0001 0009 1
      9 0001 crossing 0001 0010 1
      10 0001 crossing 0002 0001 2
      11 0001 crossing 0002 0012 2
      12 0001 crossing 0002 0013 2
      13 0001 crossing 0002 0014 2
      14 0001 crossing 0002 0015 2
      15 0001 crossing 0002 0029 2
      16 0001 crossing 0002 0030 2
      17 0001 crossing 0002 0031 2
      18 0001 crossing 0002 0032 2
      19 0001 crossing 0002 0033 2
      20 0002 crossing 0001 0034 3
      21 0002 crossing 0001 0035 3
      22 0002 crossing 0001 0036 3
      23 0002 crossing 0001 0037 3
      24 0002 crossing 0001 0038 3
      25 0002 crossing 0001 0039 3
      26 0002 crossing 0001 0049 3
      27 0002 crossing 0001 0050 3
      28 0002 crossing 0001 0051 3
      29 0002 crossing 0001 0052 3



      After filter




             seq_name     label pedestrian_id  frame_no  global_id
      0 0001 crossing 0001 0001 1
      10 0001 crossing 0002 0001 2
      11 0001 crossing 0002 0012 2
      15 0001 crossing 0002 0029 2
      25 0002 crossing 0001 0039 3
      26 0002 crossing 0001 0049 3


      Below is what I have. Once I have the indices I can create a new dataframe by indexing from the old. I am still new to Pandas and it looks extremely cumbersome so I am hoping there is a more elegant solution. I have read through the docs on groupby and some other SO posts but still cant figure it out. This isn't a homework. Just trying to clean up my data processing pipeline by replacing everything with Pandas.



      ind = 
      for j in df["global_id"].unique():
      df_temp = df[df["global_id"] == j][["frame_no"]]
      df_temp["frame_no"] = pd.to_numeric(df["frame_no"])
      start_frame = df_temp["frame_no"].min()
      end_frame = df_temp["frame_no"].max()
      i = start_frame-1
      while i < end_frame:
      ind.append(np.min(df_temp[(df_temp["frame_no"] > i) & (df_temp["frame_no"] < i+10)].index.tolist()))
      i+=10






      python pandas






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 13 '18 at 0:37









      KongKong

      5671823




      5671823
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Here is one way using groupby but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})



          I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract combined with outer to get all the differences between each element one to one



          print (np.subtract.outer(dfs.a, dfs.a))
          array([[ 0, -1, -2, -3, -13, -19, -29, -30],
          [ 1, 0, -1, -2, -12, -18, -28, -29],
          [ 2, 1, 0, -1, -11, -17, -27, -28],
          [ 3, 2, 1, 0, -10, -16, -26, -27],
          [ 13, 12, 11, 10, 0, -6, -16, -17],
          [ 19, 18, 17, 16, 6, 0, -10, -11],
          [ 29, 28, 27, 26, 16, 10, 0, -1],
          [ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)


          Now, for example, in column 0, you can see that difference >10 start at row 4, then going to column 4, difference >10 start at row 6 and going to column 6 you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer to 10 and sum over the axis=0 such as:



          arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
          print (arr)
          array([4, 4, 4, 5, 6, 7, 8, 8])


          Now you see, arr[0] = 4, then arr[4] = 6, then arr[6]=8 is out of bound in this example so stop. One way to catch these number is using a while (if someone has a numpy solution to do this, I'm interested in)



          list_ind = [0] # initialize list of index to keep with 0
          arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
          i = arr[0]
          while i < len(arr):
          list_ind.append(i)
          i = arr[i]

          print (list_ind)
          [0, 4, 6]

          print (dfs.iloc[list_ind])
          a
          0 1
          4 14
          6 30


          Now with the whole problem and groupby, you can do:



          # it seems you need to convert the column frame_no to integer
          df['frame_int'] = pd.to_numeric(df['frame_no'])
          df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id

          #define the function looking for the ind
          def find_ind (df_g):
          list_ind = [0]
          arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
          i = arr[0]
          while i <len(arr):
          list_ind.append(i)
          i = arr[i]
          return df_g.iloc[list_ind]

          #create the filtered dataframe
          df_filtered = (df.groupby('global_id').apply(find_ind)
          .drop('frame_int',axis=1).reset_index(drop=True))

          print (df_filtered)
          seq_name label pedestrian_id frame_no global_id
          0 1 crossing 1 1 1
          1 1 crossing 2 1 2
          2 1 crossing 2 12 2
          3 1 crossing 2 29 2
          4 2 crossing 1 34 3
          5 2 crossing 1 49 3


          If you want to keep the original rows' index, you can add level=0 in reset_index such as reset_index(level=0,drop=True) instead.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272116%2fpandas-for-loop-and-find-index-that-has-the-nearest-value%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Here is one way using groupby but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})



            I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract combined with outer to get all the differences between each element one to one



            print (np.subtract.outer(dfs.a, dfs.a))
            array([[ 0, -1, -2, -3, -13, -19, -29, -30],
            [ 1, 0, -1, -2, -12, -18, -28, -29],
            [ 2, 1, 0, -1, -11, -17, -27, -28],
            [ 3, 2, 1, 0, -10, -16, -26, -27],
            [ 13, 12, 11, 10, 0, -6, -16, -17],
            [ 19, 18, 17, 16, 6, 0, -10, -11],
            [ 29, 28, 27, 26, 16, 10, 0, -1],
            [ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)


            Now, for example, in column 0, you can see that difference >10 start at row 4, then going to column 4, difference >10 start at row 6 and going to column 6 you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer to 10 and sum over the axis=0 such as:



            arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
            print (arr)
            array([4, 4, 4, 5, 6, 7, 8, 8])


            Now you see, arr[0] = 4, then arr[4] = 6, then arr[6]=8 is out of bound in this example so stop. One way to catch these number is using a while (if someone has a numpy solution to do this, I'm interested in)



            list_ind = [0] # initialize list of index to keep with 0
            arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
            i = arr[0]
            while i < len(arr):
            list_ind.append(i)
            i = arr[i]

            print (list_ind)
            [0, 4, 6]

            print (dfs.iloc[list_ind])
            a
            0 1
            4 14
            6 30


            Now with the whole problem and groupby, you can do:



            # it seems you need to convert the column frame_no to integer
            df['frame_int'] = pd.to_numeric(df['frame_no'])
            df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id

            #define the function looking for the ind
            def find_ind (df_g):
            list_ind = [0]
            arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
            i = arr[0]
            while i <len(arr):
            list_ind.append(i)
            i = arr[i]
            return df_g.iloc[list_ind]

            #create the filtered dataframe
            df_filtered = (df.groupby('global_id').apply(find_ind)
            .drop('frame_int',axis=1).reset_index(drop=True))

            print (df_filtered)
            seq_name label pedestrian_id frame_no global_id
            0 1 crossing 1 1 1
            1 1 crossing 2 1 2
            2 1 crossing 2 12 2
            3 1 crossing 2 29 2
            4 2 crossing 1 34 3
            5 2 crossing 1 49 3


            If you want to keep the original rows' index, you can add level=0 in reset_index such as reset_index(level=0,drop=True) instead.






            share|improve this answer


























              1














              Here is one way using groupby but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})



              I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract combined with outer to get all the differences between each element one to one



              print (np.subtract.outer(dfs.a, dfs.a))
              array([[ 0, -1, -2, -3, -13, -19, -29, -30],
              [ 1, 0, -1, -2, -12, -18, -28, -29],
              [ 2, 1, 0, -1, -11, -17, -27, -28],
              [ 3, 2, 1, 0, -10, -16, -26, -27],
              [ 13, 12, 11, 10, 0, -6, -16, -17],
              [ 19, 18, 17, 16, 6, 0, -10, -11],
              [ 29, 28, 27, 26, 16, 10, 0, -1],
              [ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)


              Now, for example, in column 0, you can see that difference >10 start at row 4, then going to column 4, difference >10 start at row 6 and going to column 6 you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer to 10 and sum over the axis=0 such as:



              arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
              print (arr)
              array([4, 4, 4, 5, 6, 7, 8, 8])


              Now you see, arr[0] = 4, then arr[4] = 6, then arr[6]=8 is out of bound in this example so stop. One way to catch these number is using a while (if someone has a numpy solution to do this, I'm interested in)



              list_ind = [0] # initialize list of index to keep with 0
              arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
              i = arr[0]
              while i < len(arr):
              list_ind.append(i)
              i = arr[i]

              print (list_ind)
              [0, 4, 6]

              print (dfs.iloc[list_ind])
              a
              0 1
              4 14
              6 30


              Now with the whole problem and groupby, you can do:



              # it seems you need to convert the column frame_no to integer
              df['frame_int'] = pd.to_numeric(df['frame_no'])
              df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id

              #define the function looking for the ind
              def find_ind (df_g):
              list_ind = [0]
              arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
              i = arr[0]
              while i <len(arr):
              list_ind.append(i)
              i = arr[i]
              return df_g.iloc[list_ind]

              #create the filtered dataframe
              df_filtered = (df.groupby('global_id').apply(find_ind)
              .drop('frame_int',axis=1).reset_index(drop=True))

              print (df_filtered)
              seq_name label pedestrian_id frame_no global_id
              0 1 crossing 1 1 1
              1 1 crossing 2 1 2
              2 1 crossing 2 12 2
              3 1 crossing 2 29 2
              4 2 crossing 1 34 3
              5 2 crossing 1 49 3


              If you want to keep the original rows' index, you can add level=0 in reset_index such as reset_index(level=0,drop=True) instead.






              share|improve this answer
























                1












                1








                1






                Here is one way using groupby but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})



                I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract combined with outer to get all the differences between each element one to one



                print (np.subtract.outer(dfs.a, dfs.a))
                array([[ 0, -1, -2, -3, -13, -19, -29, -30],
                [ 1, 0, -1, -2, -12, -18, -28, -29],
                [ 2, 1, 0, -1, -11, -17, -27, -28],
                [ 3, 2, 1, 0, -10, -16, -26, -27],
                [ 13, 12, 11, 10, 0, -6, -16, -17],
                [ 19, 18, 17, 16, 6, 0, -10, -11],
                [ 29, 28, 27, 26, 16, 10, 0, -1],
                [ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)


                Now, for example, in column 0, you can see that difference >10 start at row 4, then going to column 4, difference >10 start at row 6 and going to column 6 you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer to 10 and sum over the axis=0 such as:



                arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
                print (arr)
                array([4, 4, 4, 5, 6, 7, 8, 8])


                Now you see, arr[0] = 4, then arr[4] = 6, then arr[6]=8 is out of bound in this example so stop. One way to catch these number is using a while (if someone has a numpy solution to do this, I'm interested in)



                list_ind = [0] # initialize list of index to keep with 0
                arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
                i = arr[0]
                while i < len(arr):
                list_ind.append(i)
                i = arr[i]

                print (list_ind)
                [0, 4, 6]

                print (dfs.iloc[list_ind])
                a
                0 1
                4 14
                6 30


                Now with the whole problem and groupby, you can do:



                # it seems you need to convert the column frame_no to integer
                df['frame_int'] = pd.to_numeric(df['frame_no'])
                df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id

                #define the function looking for the ind
                def find_ind (df_g):
                list_ind = [0]
                arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
                i = arr[0]
                while i <len(arr):
                list_ind.append(i)
                i = arr[i]
                return df_g.iloc[list_ind]

                #create the filtered dataframe
                df_filtered = (df.groupby('global_id').apply(find_ind)
                .drop('frame_int',axis=1).reset_index(drop=True))

                print (df_filtered)
                seq_name label pedestrian_id frame_no global_id
                0 1 crossing 1 1 1
                1 1 crossing 2 1 2
                2 1 crossing 2 12 2
                3 1 crossing 2 29 2
                4 2 crossing 1 34 3
                5 2 crossing 1 49 3


                If you want to keep the original rows' index, you can add level=0 in reset_index such as reset_index(level=0,drop=True) instead.






                share|improve this answer












                Here is one way using groupby but first you need to define a function doing what you look for in each group. To explain the idea, let's consider a simple dataframe dfs = pd.DataFrame({'a':[1,2,3,4,14,20,30,31]})



                I have been looking to solve this kind of problem for a while, trying to avoid looping and it seems complex. Here is the idea I end up with. In numpy, you can use substract combined with outer to get all the differences between each element one to one



                print (np.subtract.outer(dfs.a, dfs.a))
                array([[ 0, -1, -2, -3, -13, -19, -29, -30],
                [ 1, 0, -1, -2, -12, -18, -28, -29],
                [ 2, 1, 0, -1, -11, -17, -27, -28],
                [ 3, 2, 1, 0, -10, -16, -26, -27],
                [ 13, 12, 11, 10, 0, -6, -16, -17],
                [ 19, 18, 17, 16, 6, 0, -10, -11],
                [ 29, 28, 27, 26, 16, 10, 0, -1],
                [ 30, 29, 28, 27, 17, 11, 1, 0]], dtype=int64)


                Now, for example, in column 0, you can see that difference >10 start at row 4, then going to column 4, difference >10 start at row 6 and going to column 6 you don't get a difference big enough. So the filtering would be keeping row 0, 4 and 6, which is the values [1,14,30] as looking for. To get these numbers, you can compare the np.substract.outer to 10 and sum over the axis=0 such as:



                arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
                print (arr)
                array([4, 4, 4, 5, 6, 7, 8, 8])


                Now you see, arr[0] = 4, then arr[4] = 6, then arr[6]=8 is out of bound in this example so stop. One way to catch these number is using a while (if someone has a numpy solution to do this, I'm interested in)



                list_ind = [0] # initialize list of index to keep with 0
                arr = (np.subtract.outer(dfs.a, dfs.a) <=10).sum(0)
                i = arr[0]
                while i < len(arr):
                list_ind.append(i)
                i = arr[i]

                print (list_ind)
                [0, 4, 6]

                print (dfs.iloc[list_ind])
                a
                0 1
                4 14
                6 30


                Now with the whole problem and groupby, you can do:



                # it seems you need to convert the column frame_no to integer
                df['frame_int'] = pd.to_numeric(df['frame_no'])
                df = df.sort_values('frame_int') #ensure data to be sorted by frame_int, whatever the global_id

                #define the function looking for the ind
                def find_ind (df_g):
                list_ind = [0]
                arr = (np.subtract.outer(df_g.frame_int, df_g.frame_int) <= 10).sum(0)
                i = arr[0]
                while i <len(arr):
                list_ind.append(i)
                i = arr[i]
                return df_g.iloc[list_ind]

                #create the filtered dataframe
                df_filtered = (df.groupby('global_id').apply(find_ind)
                .drop('frame_int',axis=1).reset_index(drop=True))

                print (df_filtered)
                seq_name label pedestrian_id frame_no global_id
                0 1 crossing 1 1 1
                1 1 crossing 2 1 2
                2 1 crossing 2 12 2
                3 1 crossing 2 29 2
                4 2 crossing 1 34 3
                5 2 crossing 1 49 3


                If you want to keep the original rows' index, you can add level=0 in reset_index such as reset_index(level=0,drop=True) instead.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 13 '18 at 16:27









                Ben.TBen.T

                5,9572523




                5,9572523






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272116%2fpandas-for-loop-and-find-index-that-has-the-nearest-value%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Tangent Lines Diagram Along Smooth Curve

                    Yusuf al-Mu'taman ibn Hud

                    Zucchini