Fastest way to make python Object out of numpy array rows











up vote
0
down vote

favorite
1












I need to make a list of objects out of a numpy array (or a pandas dataframe). Each row holds all the attribute values for the object (see example).



import numpy as np

class Dog:

def __init__(self, weight, height, width, girth):
self.weight = weight
self.height = height
self.width = width
self.girth = girth


dogs = np.array([[5, 100, 50, 80], [4, 80, 30, 70], [7, 120, 60, 90], [2, 50, 30, 50]])

# list comprehension with idexes
dog_list = [Dog(dogs[i][0], dogs[i][1], dogs[i][2], dogs[i][3]) for i in range(len(dogs))]


My real data is of course much bigger (up to a million rows with 5 columns), so iterating line by line and looking up the correct index takes ages. Is there a way to vectorize this or generally make it more efficient/faster? I tried finding ways myself, but I couldn't find anything translatable, at least at my level of expertise.



It's extremely important that the order of rows is preserved though, so if that doesn't work out, I suppose I'll have to live with the slow operation.



Cheers!



EDIT - regarding question about np.vectorize:



This is part of my actual code along with some actual data:



import numpy as np



class Particle:
TrackID = 0
def __init__(self, uniq_ident, intensity, sigma, chi2, past_nn_ident, past_distance, aligned_x, aligned_y, NeNA):
self.uniq_ident = uniq_ident
self.intensity = intensity
self.sigma = sigma
self.chi2 = chi2
self.past_nn_ident = past_nn_ident
self.past_distance = past_distance
self.aligned_y = aligned_y
self.aligned_x = aligned_x
self.NeNA = NeNA
self.new_track_length = 1
self.quality_pass = True
self.re_seeder(self.NeNA)


def re_seeder(self, NeNA):

if np.isnan(self.past_nn_ident):
self.newseed = True
self.new_track_id = Particle.TrackID
print(self.new_track_id)
Particle.TrackID += 1

else:
self.newseed = False
self.new_track_id = None

data = np.array([[0.00000000e+00, 2.98863746e+03, 2.11794100e+02, 1.02241467e+04, np.NaN,np.NaN, 9.00081968e+02, 2.52456745e+04, 1.50000000e+01],
[1.00000000e+00, 2.80583577e+03, 4.66145720e+02, 6.05642671e+03, np.NaN, np.NaN, 8.27249728e+02, 2.26365501e+04, 1.50000000e+01],
[2.00000000e+00, 5.28702810e+02, 3.30889610e+02, 5.10632793e+03, np.NaN, np.NaN, 6.03337243e+03, 6.52702811e+04, 1.50000000e+01],
[3.00000000e+00, 3.56128350e+02, 1.38663730e+02, 3.37923885e+03, np.NaN, np.NaN, 6.43263261e+03, 6.14788766e+04, 1.50000000e+01],
[4.00000000e+00, 9.10148200e+01, 8.30057400e+01, 4.31205993e+03, np.NaN, np.NaN, 7.63955009e+03, 6.08925862e+04, 1.50000000e+01]])

Particle.TrackID = 0
particles = np.vectorize(Particle)(*data.transpose())

l = [p.new_track_id for p in particles]


The curious thing about this is that the print statement inside the ree_seeder function "print(self.new_track_id)", it prints 0, 1, 2, 3, 4, 5.



If I then take the particle objects and make a list out of their new_track_id attributes "l = [p.new_track_id for p in particles]" the values are 1, 2, 3, 4, 5.



So somewhere, somehow the first object is either lost, re-written or something else I don't understand.










share|improve this question
























  • Not sure if this is any faster but it is simpler: dog_list = [Dog(*row) for row in dogs]
    – Tomothy32
    Nov 8 at 9:00












  • Better [Dog(*x) for x in dogs.tolist()]
    – Paul Panzer
    Nov 8 at 9:02










  • Thanks, these should at least keep my code cleaner!
    – David
    Nov 8 at 9:16






  • 2




    Vectorizing the class constructor gives you another boost: dog_list = np.vectorize(Dog)(*dogs.transpose())
    – Jeronimo
    Nov 8 at 9:17










  • @Jeronimo Holy crap, that just sped up my code from 50s to 1.3s :D Thanks a ton!
    – David
    Nov 8 at 9:45















up vote
0
down vote

favorite
1












I need to make a list of objects out of a numpy array (or a pandas dataframe). Each row holds all the attribute values for the object (see example).



import numpy as np

class Dog:

def __init__(self, weight, height, width, girth):
self.weight = weight
self.height = height
self.width = width
self.girth = girth


dogs = np.array([[5, 100, 50, 80], [4, 80, 30, 70], [7, 120, 60, 90], [2, 50, 30, 50]])

# list comprehension with idexes
dog_list = [Dog(dogs[i][0], dogs[i][1], dogs[i][2], dogs[i][3]) for i in range(len(dogs))]


My real data is of course much bigger (up to a million rows with 5 columns), so iterating line by line and looking up the correct index takes ages. Is there a way to vectorize this or generally make it more efficient/faster? I tried finding ways myself, but I couldn't find anything translatable, at least at my level of expertise.



It's extremely important that the order of rows is preserved though, so if that doesn't work out, I suppose I'll have to live with the slow operation.



Cheers!



EDIT - regarding question about np.vectorize:



This is part of my actual code along with some actual data:



import numpy as np



class Particle:
TrackID = 0
def __init__(self, uniq_ident, intensity, sigma, chi2, past_nn_ident, past_distance, aligned_x, aligned_y, NeNA):
self.uniq_ident = uniq_ident
self.intensity = intensity
self.sigma = sigma
self.chi2 = chi2
self.past_nn_ident = past_nn_ident
self.past_distance = past_distance
self.aligned_y = aligned_y
self.aligned_x = aligned_x
self.NeNA = NeNA
self.new_track_length = 1
self.quality_pass = True
self.re_seeder(self.NeNA)


def re_seeder(self, NeNA):

if np.isnan(self.past_nn_ident):
self.newseed = True
self.new_track_id = Particle.TrackID
print(self.new_track_id)
Particle.TrackID += 1

else:
self.newseed = False
self.new_track_id = None

data = np.array([[0.00000000e+00, 2.98863746e+03, 2.11794100e+02, 1.02241467e+04, np.NaN,np.NaN, 9.00081968e+02, 2.52456745e+04, 1.50000000e+01],
[1.00000000e+00, 2.80583577e+03, 4.66145720e+02, 6.05642671e+03, np.NaN, np.NaN, 8.27249728e+02, 2.26365501e+04, 1.50000000e+01],
[2.00000000e+00, 5.28702810e+02, 3.30889610e+02, 5.10632793e+03, np.NaN, np.NaN, 6.03337243e+03, 6.52702811e+04, 1.50000000e+01],
[3.00000000e+00, 3.56128350e+02, 1.38663730e+02, 3.37923885e+03, np.NaN, np.NaN, 6.43263261e+03, 6.14788766e+04, 1.50000000e+01],
[4.00000000e+00, 9.10148200e+01, 8.30057400e+01, 4.31205993e+03, np.NaN, np.NaN, 7.63955009e+03, 6.08925862e+04, 1.50000000e+01]])

Particle.TrackID = 0
particles = np.vectorize(Particle)(*data.transpose())

l = [p.new_track_id for p in particles]


The curious thing about this is that the print statement inside the ree_seeder function "print(self.new_track_id)", it prints 0, 1, 2, 3, 4, 5.



If I then take the particle objects and make a list out of their new_track_id attributes "l = [p.new_track_id for p in particles]" the values are 1, 2, 3, 4, 5.



So somewhere, somehow the first object is either lost, re-written or something else I don't understand.










share|improve this question
























  • Not sure if this is any faster but it is simpler: dog_list = [Dog(*row) for row in dogs]
    – Tomothy32
    Nov 8 at 9:00












  • Better [Dog(*x) for x in dogs.tolist()]
    – Paul Panzer
    Nov 8 at 9:02










  • Thanks, these should at least keep my code cleaner!
    – David
    Nov 8 at 9:16






  • 2




    Vectorizing the class constructor gives you another boost: dog_list = np.vectorize(Dog)(*dogs.transpose())
    – Jeronimo
    Nov 8 at 9:17










  • @Jeronimo Holy crap, that just sped up my code from 50s to 1.3s :D Thanks a ton!
    – David
    Nov 8 at 9:45













up vote
0
down vote

favorite
1









up vote
0
down vote

favorite
1






1





I need to make a list of objects out of a numpy array (or a pandas dataframe). Each row holds all the attribute values for the object (see example).



import numpy as np

class Dog:

def __init__(self, weight, height, width, girth):
self.weight = weight
self.height = height
self.width = width
self.girth = girth


dogs = np.array([[5, 100, 50, 80], [4, 80, 30, 70], [7, 120, 60, 90], [2, 50, 30, 50]])

# list comprehension with idexes
dog_list = [Dog(dogs[i][0], dogs[i][1], dogs[i][2], dogs[i][3]) for i in range(len(dogs))]


My real data is of course much bigger (up to a million rows with 5 columns), so iterating line by line and looking up the correct index takes ages. Is there a way to vectorize this or generally make it more efficient/faster? I tried finding ways myself, but I couldn't find anything translatable, at least at my level of expertise.



It's extremely important that the order of rows is preserved though, so if that doesn't work out, I suppose I'll have to live with the slow operation.



Cheers!



EDIT - regarding question about np.vectorize:



This is part of my actual code along with some actual data:



import numpy as np



class Particle:
TrackID = 0
def __init__(self, uniq_ident, intensity, sigma, chi2, past_nn_ident, past_distance, aligned_x, aligned_y, NeNA):
self.uniq_ident = uniq_ident
self.intensity = intensity
self.sigma = sigma
self.chi2 = chi2
self.past_nn_ident = past_nn_ident
self.past_distance = past_distance
self.aligned_y = aligned_y
self.aligned_x = aligned_x
self.NeNA = NeNA
self.new_track_length = 1
self.quality_pass = True
self.re_seeder(self.NeNA)


def re_seeder(self, NeNA):

if np.isnan(self.past_nn_ident):
self.newseed = True
self.new_track_id = Particle.TrackID
print(self.new_track_id)
Particle.TrackID += 1

else:
self.newseed = False
self.new_track_id = None

data = np.array([[0.00000000e+00, 2.98863746e+03, 2.11794100e+02, 1.02241467e+04, np.NaN,np.NaN, 9.00081968e+02, 2.52456745e+04, 1.50000000e+01],
[1.00000000e+00, 2.80583577e+03, 4.66145720e+02, 6.05642671e+03, np.NaN, np.NaN, 8.27249728e+02, 2.26365501e+04, 1.50000000e+01],
[2.00000000e+00, 5.28702810e+02, 3.30889610e+02, 5.10632793e+03, np.NaN, np.NaN, 6.03337243e+03, 6.52702811e+04, 1.50000000e+01],
[3.00000000e+00, 3.56128350e+02, 1.38663730e+02, 3.37923885e+03, np.NaN, np.NaN, 6.43263261e+03, 6.14788766e+04, 1.50000000e+01],
[4.00000000e+00, 9.10148200e+01, 8.30057400e+01, 4.31205993e+03, np.NaN, np.NaN, 7.63955009e+03, 6.08925862e+04, 1.50000000e+01]])

Particle.TrackID = 0
particles = np.vectorize(Particle)(*data.transpose())

l = [p.new_track_id for p in particles]


The curious thing about this is that the print statement inside the ree_seeder function "print(self.new_track_id)", it prints 0, 1, 2, 3, 4, 5.



If I then take the particle objects and make a list out of their new_track_id attributes "l = [p.new_track_id for p in particles]" the values are 1, 2, 3, 4, 5.



So somewhere, somehow the first object is either lost, re-written or something else I don't understand.










share|improve this question















I need to make a list of objects out of a numpy array (or a pandas dataframe). Each row holds all the attribute values for the object (see example).



import numpy as np

class Dog:

def __init__(self, weight, height, width, girth):
self.weight = weight
self.height = height
self.width = width
self.girth = girth


dogs = np.array([[5, 100, 50, 80], [4, 80, 30, 70], [7, 120, 60, 90], [2, 50, 30, 50]])

# list comprehension with idexes
dog_list = [Dog(dogs[i][0], dogs[i][1], dogs[i][2], dogs[i][3]) for i in range(len(dogs))]


My real data is of course much bigger (up to a million rows with 5 columns), so iterating line by line and looking up the correct index takes ages. Is there a way to vectorize this or generally make it more efficient/faster? I tried finding ways myself, but I couldn't find anything translatable, at least at my level of expertise.



It's extremely important that the order of rows is preserved though, so if that doesn't work out, I suppose I'll have to live with the slow operation.



Cheers!



EDIT - regarding question about np.vectorize:



This is part of my actual code along with some actual data:



import numpy as np



class Particle:
TrackID = 0
def __init__(self, uniq_ident, intensity, sigma, chi2, past_nn_ident, past_distance, aligned_x, aligned_y, NeNA):
self.uniq_ident = uniq_ident
self.intensity = intensity
self.sigma = sigma
self.chi2 = chi2
self.past_nn_ident = past_nn_ident
self.past_distance = past_distance
self.aligned_y = aligned_y
self.aligned_x = aligned_x
self.NeNA = NeNA
self.new_track_length = 1
self.quality_pass = True
self.re_seeder(self.NeNA)


def re_seeder(self, NeNA):

if np.isnan(self.past_nn_ident):
self.newseed = True
self.new_track_id = Particle.TrackID
print(self.new_track_id)
Particle.TrackID += 1

else:
self.newseed = False
self.new_track_id = None

data = np.array([[0.00000000e+00, 2.98863746e+03, 2.11794100e+02, 1.02241467e+04, np.NaN,np.NaN, 9.00081968e+02, 2.52456745e+04, 1.50000000e+01],
[1.00000000e+00, 2.80583577e+03, 4.66145720e+02, 6.05642671e+03, np.NaN, np.NaN, 8.27249728e+02, 2.26365501e+04, 1.50000000e+01],
[2.00000000e+00, 5.28702810e+02, 3.30889610e+02, 5.10632793e+03, np.NaN, np.NaN, 6.03337243e+03, 6.52702811e+04, 1.50000000e+01],
[3.00000000e+00, 3.56128350e+02, 1.38663730e+02, 3.37923885e+03, np.NaN, np.NaN, 6.43263261e+03, 6.14788766e+04, 1.50000000e+01],
[4.00000000e+00, 9.10148200e+01, 8.30057400e+01, 4.31205993e+03, np.NaN, np.NaN, 7.63955009e+03, 6.08925862e+04, 1.50000000e+01]])

Particle.TrackID = 0
particles = np.vectorize(Particle)(*data.transpose())

l = [p.new_track_id for p in particles]


The curious thing about this is that the print statement inside the ree_seeder function "print(self.new_track_id)", it prints 0, 1, 2, 3, 4, 5.



If I then take the particle objects and make a list out of their new_track_id attributes "l = [p.new_track_id for p in particles]" the values are 1, 2, 3, 4, 5.



So somewhere, somehow the first object is either lost, re-written or something else I don't understand.







python-3.x pandas numpy oop vectorization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 16:23

























asked Nov 8 at 8:47









David

85




85












  • Not sure if this is any faster but it is simpler: dog_list = [Dog(*row) for row in dogs]
    – Tomothy32
    Nov 8 at 9:00












  • Better [Dog(*x) for x in dogs.tolist()]
    – Paul Panzer
    Nov 8 at 9:02










  • Thanks, these should at least keep my code cleaner!
    – David
    Nov 8 at 9:16






  • 2




    Vectorizing the class constructor gives you another boost: dog_list = np.vectorize(Dog)(*dogs.transpose())
    – Jeronimo
    Nov 8 at 9:17










  • @Jeronimo Holy crap, that just sped up my code from 50s to 1.3s :D Thanks a ton!
    – David
    Nov 8 at 9:45


















  • Not sure if this is any faster but it is simpler: dog_list = [Dog(*row) for row in dogs]
    – Tomothy32
    Nov 8 at 9:00












  • Better [Dog(*x) for x in dogs.tolist()]
    – Paul Panzer
    Nov 8 at 9:02










  • Thanks, these should at least keep my code cleaner!
    – David
    Nov 8 at 9:16






  • 2




    Vectorizing the class constructor gives you another boost: dog_list = np.vectorize(Dog)(*dogs.transpose())
    – Jeronimo
    Nov 8 at 9:17










  • @Jeronimo Holy crap, that just sped up my code from 50s to 1.3s :D Thanks a ton!
    – David
    Nov 8 at 9:45
















Not sure if this is any faster but it is simpler: dog_list = [Dog(*row) for row in dogs]
– Tomothy32
Nov 8 at 9:00






Not sure if this is any faster but it is simpler: dog_list = [Dog(*row) for row in dogs]
– Tomothy32
Nov 8 at 9:00














Better [Dog(*x) for x in dogs.tolist()]
– Paul Panzer
Nov 8 at 9:02




Better [Dog(*x) for x in dogs.tolist()]
– Paul Panzer
Nov 8 at 9:02












Thanks, these should at least keep my code cleaner!
– David
Nov 8 at 9:16




Thanks, these should at least keep my code cleaner!
– David
Nov 8 at 9:16




2




2




Vectorizing the class constructor gives you another boost: dog_list = np.vectorize(Dog)(*dogs.transpose())
– Jeronimo
Nov 8 at 9:17




Vectorizing the class constructor gives you another boost: dog_list = np.vectorize(Dog)(*dogs.transpose())
– Jeronimo
Nov 8 at 9:17












@Jeronimo Holy crap, that just sped up my code from 50s to 1.3s :D Thanks a ton!
– David
Nov 8 at 9:45




@Jeronimo Holy crap, that just sped up my code from 50s to 1.3s :D Thanks a ton!
– David
Nov 8 at 9:45












3 Answers
3






active

oldest

votes

















up vote
2
down vote













You won't get great efficiency/speed gains as long as you are insisting on building Python objects. With that many items, you will be much better served by keeping the data in the numpy array. If you want nicer attribute access, you could cast the array as a record array (recarray), which would allow you to name the columns (as weight, height, etc) while still having the data in the numpy array.



dog_t = np.dtype([
('weight', int),
('height', int),
('width', int),
('girth', int)
])

dogs = np.array([
(5, 100, 50, 80),
(4, 80, 30, 70),
(7, 120, 60, 90),
(2, 50, 30, 50),
], dtype=dog_t)

dogs_recarray = dogs.view(np.recarray)

print(dogs_recarray.weight)
print(dogs_recarray[2].height)


You can also mix and match data types if you need to (if some columns are integer and others are float, for example). Be aware when playing with this code that the items in the dogs array needs to be specified in tuples (using ()) rather than in lists for the datatype to be applied properly.






share|improve this answer























  • Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
    – David
    Nov 8 at 9:16










  • You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
    – lxop
    Nov 8 at 9:25




















up vote
0
down vote













Multiprocessing might be worth a look.



from multiprocessing import Pool
dog_list =



Function to append objects to the list:



def append_dog(i):
dog_list.append(Dog(*dogs[i]))



Let multiple workers append to this list in parallel:



number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(append_dog, range(len(dogs)))



Or as a shorter version:



from multiprocessing import Pool
number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(lambda i: dog_list.append(Dog(*dogs[i])), range(len(dogs)))





share|improve this answer























  • Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
    – David
    Nov 14 at 13:14


















up vote
0
down vote













With a simple class:



class Foo():
_id = 0
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
self.id = self._id
Foo._id += 1
def __repr__(self):
return '<Foo %s>'%self.id


In [23]: arr = np.arange(12).reshape(4,3)


A straightforward list comprehension:



In [24]: [Foo(*xyz) for xyz in arr]
Out[24]: [<Foo 0>, <Foo 1>, <Foo 2>, <Foo 3>]


Default use of vectorize:



In [26]: np.vectorize(Foo)(*arr.T)
Out[26]: array([<Foo 5>, <Foo 6>, <Foo 7>, <Foo 8>], dtype=object)


Note that Foo 4 was skipped. vectorize performs a trial calculation to determine the return dtype (here object). (This has caused problems for other users.) We can get around that by specifying otypes. There's also a cache parameter that might work, but I haven't played with that.



In [27]: np.vectorize(Foo,otypes=[object])(*arr.T)
Out[27]: array([<Foo 9>, <Foo 10>, <Foo 11>, <Foo 12>], dtype=object)


Internally vectorize uses frompyfunc, which in this case works just as well, and in my experience is faster:



In [28]: np.frompyfunc(Foo, 3,1)(*arr.T)
Out[28]: array([<Foo 13>, <Foo 14>, <Foo 15>, <Foo 16>], dtype=object)


Normally vectorize/frompyfunc pass 'scalar' values to the function, iterating overall elements of a 2d array. But the use of *arr.T is a clever way of passing rows - effectively a 1d array of tuples.



In [31]: list(zip(*arr.T)) 
Out[31]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]




Some comparative times:



In [32]: Foo._id=0
In [33]: timeit [Foo(*xyz) for xyz in arr]
14.2 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [34]: Foo._id=0
In [35]: timeit np.vectorize(Foo,otypes=[object])(*arr.T)
44.9 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: Foo._id=0
In [37]: timeit np.frompyfunc(Foo, 3,1)(*arr.T)
15.6 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


This is consistent with my past timings. vectorize is slow. frompyfunc is competitive with a list comprehension, sometimes even 2x faster. Wrapping the list comprehension in an array will slow it down, e.g. np.array([Foo(*xyz)...]).



And your original list comprehension:



In [40]: timeit [Foo(arr[i][0],arr[i][1],arr[i][2]) for i in range(len(arr))]
10.1 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


That's even faster! So if your goal is a list rather than an array, I don't see the point to using numpy tools.



Of course these timings on a small example need to be viewed with caution.






share|improve this answer























  • Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
    – Jeronimo
    Nov 8 at 18:14










  • @Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
    – Paul Panzer
    Nov 8 at 19:38












  • Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
    – David
    Nov 9 at 8:37











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204207%2ffastest-way-to-make-python-object-out-of-numpy-array-rows%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote













You won't get great efficiency/speed gains as long as you are insisting on building Python objects. With that many items, you will be much better served by keeping the data in the numpy array. If you want nicer attribute access, you could cast the array as a record array (recarray), which would allow you to name the columns (as weight, height, etc) while still having the data in the numpy array.



dog_t = np.dtype([
('weight', int),
('height', int),
('width', int),
('girth', int)
])

dogs = np.array([
(5, 100, 50, 80),
(4, 80, 30, 70),
(7, 120, 60, 90),
(2, 50, 30, 50),
], dtype=dog_t)

dogs_recarray = dogs.view(np.recarray)

print(dogs_recarray.weight)
print(dogs_recarray[2].height)


You can also mix and match data types if you need to (if some columns are integer and others are float, for example). Be aware when playing with this code that the items in the dogs array needs to be specified in tuples (using ()) rather than in lists for the datatype to be applied properly.






share|improve this answer























  • Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
    – David
    Nov 8 at 9:16










  • You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
    – lxop
    Nov 8 at 9:25

















up vote
2
down vote













You won't get great efficiency/speed gains as long as you are insisting on building Python objects. With that many items, you will be much better served by keeping the data in the numpy array. If you want nicer attribute access, you could cast the array as a record array (recarray), which would allow you to name the columns (as weight, height, etc) while still having the data in the numpy array.



dog_t = np.dtype([
('weight', int),
('height', int),
('width', int),
('girth', int)
])

dogs = np.array([
(5, 100, 50, 80),
(4, 80, 30, 70),
(7, 120, 60, 90),
(2, 50, 30, 50),
], dtype=dog_t)

dogs_recarray = dogs.view(np.recarray)

print(dogs_recarray.weight)
print(dogs_recarray[2].height)


You can also mix and match data types if you need to (if some columns are integer and others are float, for example). Be aware when playing with this code that the items in the dogs array needs to be specified in tuples (using ()) rather than in lists for the datatype to be applied properly.






share|improve this answer























  • Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
    – David
    Nov 8 at 9:16










  • You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
    – lxop
    Nov 8 at 9:25















up vote
2
down vote










up vote
2
down vote









You won't get great efficiency/speed gains as long as you are insisting on building Python objects. With that many items, you will be much better served by keeping the data in the numpy array. If you want nicer attribute access, you could cast the array as a record array (recarray), which would allow you to name the columns (as weight, height, etc) while still having the data in the numpy array.



dog_t = np.dtype([
('weight', int),
('height', int),
('width', int),
('girth', int)
])

dogs = np.array([
(5, 100, 50, 80),
(4, 80, 30, 70),
(7, 120, 60, 90),
(2, 50, 30, 50),
], dtype=dog_t)

dogs_recarray = dogs.view(np.recarray)

print(dogs_recarray.weight)
print(dogs_recarray[2].height)


You can also mix and match data types if you need to (if some columns are integer and others are float, for example). Be aware when playing with this code that the items in the dogs array needs to be specified in tuples (using ()) rather than in lists for the datatype to be applied properly.






share|improve this answer














You won't get great efficiency/speed gains as long as you are insisting on building Python objects. With that many items, you will be much better served by keeping the data in the numpy array. If you want nicer attribute access, you could cast the array as a record array (recarray), which would allow you to name the columns (as weight, height, etc) while still having the data in the numpy array.



dog_t = np.dtype([
('weight', int),
('height', int),
('width', int),
('girth', int)
])

dogs = np.array([
(5, 100, 50, 80),
(4, 80, 30, 70),
(7, 120, 60, 90),
(2, 50, 30, 50),
], dtype=dog_t)

dogs_recarray = dogs.view(np.recarray)

print(dogs_recarray.weight)
print(dogs_recarray[2].height)


You can also mix and match data types if you need to (if some columns are integer and others are float, for example). Be aware when playing with this code that the items in the dogs array needs to be specified in tuples (using ()) rather than in lists for the datatype to be applied properly.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 8 at 9:11

























answered Nov 8 at 9:05









lxop

2,5871920




2,5871920












  • Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
    – David
    Nov 8 at 9:16










  • You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
    – lxop
    Nov 8 at 9:25




















  • Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
    – David
    Nov 8 at 9:16










  • You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
    – lxop
    Nov 8 at 9:25


















Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
– David
Nov 8 at 9:16




Thanks! Unfortunately I have to use objects in this case (it would take some major redesigning and the time investment wouldn't be worth it) so I guess I'll have to live with it being a bit slow. At least I now know I don't need to keep searching!
– David
Nov 8 at 9:16












You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
– lxop
Nov 8 at 9:25






You might still find a few tweaks that can help. @jeronimo has left a comment about using np.vectorize which might be useful. With that many objects, using slots on your Dog class might help a bit too
– lxop
Nov 8 at 9:25














up vote
0
down vote













Multiprocessing might be worth a look.



from multiprocessing import Pool
dog_list =



Function to append objects to the list:



def append_dog(i):
dog_list.append(Dog(*dogs[i]))



Let multiple workers append to this list in parallel:



number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(append_dog, range(len(dogs)))



Or as a shorter version:



from multiprocessing import Pool
number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(lambda i: dog_list.append(Dog(*dogs[i])), range(len(dogs)))





share|improve this answer























  • Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
    – David
    Nov 14 at 13:14















up vote
0
down vote













Multiprocessing might be worth a look.



from multiprocessing import Pool
dog_list =



Function to append objects to the list:



def append_dog(i):
dog_list.append(Dog(*dogs[i]))



Let multiple workers append to this list in parallel:



number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(append_dog, range(len(dogs)))



Or as a shorter version:



from multiprocessing import Pool
number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(lambda i: dog_list.append(Dog(*dogs[i])), range(len(dogs)))





share|improve this answer























  • Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
    – David
    Nov 14 at 13:14













up vote
0
down vote










up vote
0
down vote









Multiprocessing might be worth a look.



from multiprocessing import Pool
dog_list =



Function to append objects to the list:



def append_dog(i):
dog_list.append(Dog(*dogs[i]))



Let multiple workers append to this list in parallel:



number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(append_dog, range(len(dogs)))



Or as a shorter version:



from multiprocessing import Pool
number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(lambda i: dog_list.append(Dog(*dogs[i])), range(len(dogs)))





share|improve this answer














Multiprocessing might be worth a look.



from multiprocessing import Pool
dog_list =



Function to append objects to the list:



def append_dog(i):
dog_list.append(Dog(*dogs[i]))



Let multiple workers append to this list in parallel:



number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(append_dog, range(len(dogs)))



Or as a shorter version:



from multiprocessing import Pool
number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(lambda i: dog_list.append(Dog(*dogs[i])), range(len(dogs)))






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 8 at 9:56

























answered Nov 8 at 9:50









randomwalker

2377




2377












  • Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
    – David
    Nov 14 at 13:14


















  • Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
    – David
    Nov 14 at 13:14
















Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
– David
Nov 14 at 13:14




Thanks! I ended up using the np.vectorize function Jeronimo proposed, but this will fit very nicely for another thing I'm working on.
– David
Nov 14 at 13:14










up vote
0
down vote













With a simple class:



class Foo():
_id = 0
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
self.id = self._id
Foo._id += 1
def __repr__(self):
return '<Foo %s>'%self.id


In [23]: arr = np.arange(12).reshape(4,3)


A straightforward list comprehension:



In [24]: [Foo(*xyz) for xyz in arr]
Out[24]: [<Foo 0>, <Foo 1>, <Foo 2>, <Foo 3>]


Default use of vectorize:



In [26]: np.vectorize(Foo)(*arr.T)
Out[26]: array([<Foo 5>, <Foo 6>, <Foo 7>, <Foo 8>], dtype=object)


Note that Foo 4 was skipped. vectorize performs a trial calculation to determine the return dtype (here object). (This has caused problems for other users.) We can get around that by specifying otypes. There's also a cache parameter that might work, but I haven't played with that.



In [27]: np.vectorize(Foo,otypes=[object])(*arr.T)
Out[27]: array([<Foo 9>, <Foo 10>, <Foo 11>, <Foo 12>], dtype=object)


Internally vectorize uses frompyfunc, which in this case works just as well, and in my experience is faster:



In [28]: np.frompyfunc(Foo, 3,1)(*arr.T)
Out[28]: array([<Foo 13>, <Foo 14>, <Foo 15>, <Foo 16>], dtype=object)


Normally vectorize/frompyfunc pass 'scalar' values to the function, iterating overall elements of a 2d array. But the use of *arr.T is a clever way of passing rows - effectively a 1d array of tuples.



In [31]: list(zip(*arr.T)) 
Out[31]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]




Some comparative times:



In [32]: Foo._id=0
In [33]: timeit [Foo(*xyz) for xyz in arr]
14.2 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [34]: Foo._id=0
In [35]: timeit np.vectorize(Foo,otypes=[object])(*arr.T)
44.9 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: Foo._id=0
In [37]: timeit np.frompyfunc(Foo, 3,1)(*arr.T)
15.6 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


This is consistent with my past timings. vectorize is slow. frompyfunc is competitive with a list comprehension, sometimes even 2x faster. Wrapping the list comprehension in an array will slow it down, e.g. np.array([Foo(*xyz)...]).



And your original list comprehension:



In [40]: timeit [Foo(arr[i][0],arr[i][1],arr[i][2]) for i in range(len(arr))]
10.1 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


That's even faster! So if your goal is a list rather than an array, I don't see the point to using numpy tools.



Of course these timings on a small example need to be viewed with caution.






share|improve this answer























  • Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
    – Jeronimo
    Nov 8 at 18:14










  • @Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
    – Paul Panzer
    Nov 8 at 19:38












  • Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
    – David
    Nov 9 at 8:37















up vote
0
down vote













With a simple class:



class Foo():
_id = 0
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
self.id = self._id
Foo._id += 1
def __repr__(self):
return '<Foo %s>'%self.id


In [23]: arr = np.arange(12).reshape(4,3)


A straightforward list comprehension:



In [24]: [Foo(*xyz) for xyz in arr]
Out[24]: [<Foo 0>, <Foo 1>, <Foo 2>, <Foo 3>]


Default use of vectorize:



In [26]: np.vectorize(Foo)(*arr.T)
Out[26]: array([<Foo 5>, <Foo 6>, <Foo 7>, <Foo 8>], dtype=object)


Note that Foo 4 was skipped. vectorize performs a trial calculation to determine the return dtype (here object). (This has caused problems for other users.) We can get around that by specifying otypes. There's also a cache parameter that might work, but I haven't played with that.



In [27]: np.vectorize(Foo,otypes=[object])(*arr.T)
Out[27]: array([<Foo 9>, <Foo 10>, <Foo 11>, <Foo 12>], dtype=object)


Internally vectorize uses frompyfunc, which in this case works just as well, and in my experience is faster:



In [28]: np.frompyfunc(Foo, 3,1)(*arr.T)
Out[28]: array([<Foo 13>, <Foo 14>, <Foo 15>, <Foo 16>], dtype=object)


Normally vectorize/frompyfunc pass 'scalar' values to the function, iterating overall elements of a 2d array. But the use of *arr.T is a clever way of passing rows - effectively a 1d array of tuples.



In [31]: list(zip(*arr.T)) 
Out[31]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]




Some comparative times:



In [32]: Foo._id=0
In [33]: timeit [Foo(*xyz) for xyz in arr]
14.2 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [34]: Foo._id=0
In [35]: timeit np.vectorize(Foo,otypes=[object])(*arr.T)
44.9 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: Foo._id=0
In [37]: timeit np.frompyfunc(Foo, 3,1)(*arr.T)
15.6 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


This is consistent with my past timings. vectorize is slow. frompyfunc is competitive with a list comprehension, sometimes even 2x faster. Wrapping the list comprehension in an array will slow it down, e.g. np.array([Foo(*xyz)...]).



And your original list comprehension:



In [40]: timeit [Foo(arr[i][0],arr[i][1],arr[i][2]) for i in range(len(arr))]
10.1 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


That's even faster! So if your goal is a list rather than an array, I don't see the point to using numpy tools.



Of course these timings on a small example need to be viewed with caution.






share|improve this answer























  • Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
    – Jeronimo
    Nov 8 at 18:14










  • @Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
    – Paul Panzer
    Nov 8 at 19:38












  • Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
    – David
    Nov 9 at 8:37













up vote
0
down vote










up vote
0
down vote









With a simple class:



class Foo():
_id = 0
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
self.id = self._id
Foo._id += 1
def __repr__(self):
return '<Foo %s>'%self.id


In [23]: arr = np.arange(12).reshape(4,3)


A straightforward list comprehension:



In [24]: [Foo(*xyz) for xyz in arr]
Out[24]: [<Foo 0>, <Foo 1>, <Foo 2>, <Foo 3>]


Default use of vectorize:



In [26]: np.vectorize(Foo)(*arr.T)
Out[26]: array([<Foo 5>, <Foo 6>, <Foo 7>, <Foo 8>], dtype=object)


Note that Foo 4 was skipped. vectorize performs a trial calculation to determine the return dtype (here object). (This has caused problems for other users.) We can get around that by specifying otypes. There's also a cache parameter that might work, but I haven't played with that.



In [27]: np.vectorize(Foo,otypes=[object])(*arr.T)
Out[27]: array([<Foo 9>, <Foo 10>, <Foo 11>, <Foo 12>], dtype=object)


Internally vectorize uses frompyfunc, which in this case works just as well, and in my experience is faster:



In [28]: np.frompyfunc(Foo, 3,1)(*arr.T)
Out[28]: array([<Foo 13>, <Foo 14>, <Foo 15>, <Foo 16>], dtype=object)


Normally vectorize/frompyfunc pass 'scalar' values to the function, iterating overall elements of a 2d array. But the use of *arr.T is a clever way of passing rows - effectively a 1d array of tuples.



In [31]: list(zip(*arr.T)) 
Out[31]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]




Some comparative times:



In [32]: Foo._id=0
In [33]: timeit [Foo(*xyz) for xyz in arr]
14.2 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [34]: Foo._id=0
In [35]: timeit np.vectorize(Foo,otypes=[object])(*arr.T)
44.9 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: Foo._id=0
In [37]: timeit np.frompyfunc(Foo, 3,1)(*arr.T)
15.6 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


This is consistent with my past timings. vectorize is slow. frompyfunc is competitive with a list comprehension, sometimes even 2x faster. Wrapping the list comprehension in an array will slow it down, e.g. np.array([Foo(*xyz)...]).



And your original list comprehension:



In [40]: timeit [Foo(arr[i][0],arr[i][1],arr[i][2]) for i in range(len(arr))]
10.1 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


That's even faster! So if your goal is a list rather than an array, I don't see the point to using numpy tools.



Of course these timings on a small example need to be viewed with caution.






share|improve this answer














With a simple class:



class Foo():
_id = 0
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
self.id = self._id
Foo._id += 1
def __repr__(self):
return '<Foo %s>'%self.id


In [23]: arr = np.arange(12).reshape(4,3)


A straightforward list comprehension:



In [24]: [Foo(*xyz) for xyz in arr]
Out[24]: [<Foo 0>, <Foo 1>, <Foo 2>, <Foo 3>]


Default use of vectorize:



In [26]: np.vectorize(Foo)(*arr.T)
Out[26]: array([<Foo 5>, <Foo 6>, <Foo 7>, <Foo 8>], dtype=object)


Note that Foo 4 was skipped. vectorize performs a trial calculation to determine the return dtype (here object). (This has caused problems for other users.) We can get around that by specifying otypes. There's also a cache parameter that might work, but I haven't played with that.



In [27]: np.vectorize(Foo,otypes=[object])(*arr.T)
Out[27]: array([<Foo 9>, <Foo 10>, <Foo 11>, <Foo 12>], dtype=object)


Internally vectorize uses frompyfunc, which in this case works just as well, and in my experience is faster:



In [28]: np.frompyfunc(Foo, 3,1)(*arr.T)
Out[28]: array([<Foo 13>, <Foo 14>, <Foo 15>, <Foo 16>], dtype=object)


Normally vectorize/frompyfunc pass 'scalar' values to the function, iterating overall elements of a 2d array. But the use of *arr.T is a clever way of passing rows - effectively a 1d array of tuples.



In [31]: list(zip(*arr.T)) 
Out[31]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]




Some comparative times:



In [32]: Foo._id=0
In [33]: timeit [Foo(*xyz) for xyz in arr]
14.2 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [34]: Foo._id=0
In [35]: timeit np.vectorize(Foo,otypes=[object])(*arr.T)
44.9 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: Foo._id=0
In [37]: timeit np.frompyfunc(Foo, 3,1)(*arr.T)
15.6 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


This is consistent with my past timings. vectorize is slow. frompyfunc is competitive with a list comprehension, sometimes even 2x faster. Wrapping the list comprehension in an array will slow it down, e.g. np.array([Foo(*xyz)...]).



And your original list comprehension:



In [40]: timeit [Foo(arr[i][0],arr[i][1],arr[i][2]) for i in range(len(arr))]
10.1 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


That's even faster! So if your goal is a list rather than an array, I don't see the point to using numpy tools.



Of course these timings on a small example need to be viewed with caution.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 8 at 17:52

























answered Nov 8 at 17:46









hpaulj

108k674138




108k674138












  • Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
    – Jeronimo
    Nov 8 at 18:14










  • @Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
    – Paul Panzer
    Nov 8 at 19:38












  • Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
    – David
    Nov 9 at 8:37


















  • Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
    – Jeronimo
    Nov 8 at 18:14










  • @Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
    – Paul Panzer
    Nov 8 at 19:38












  • Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
    – David
    Nov 9 at 8:37
















Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
– Jeronimo
Nov 8 at 18:14




Interesting. Can you do these timings again with a big array, say np.random.randint(1, 100, (1000000, 4))?
– Jeronimo
Nov 8 at 18:14












@Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
– Paul Panzer
Nov 8 at 19:38






@Jeronimo, hpaulj I think besides constant overheads there is one significant cost which is numpy's slow __getitem__. vectorize, frompyfunc and .tolist all avoid this and consequently scale similar and better than other approaches. For small arrays .tolist seems fastest, for large arrays frompyfunc
– Paul Panzer
Nov 8 at 19:38














Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
– David
Nov 9 at 8:37




Interesting, that solves the mystery, thanks a ton! That one missing cunter was driving me crazy! Vectorize and frompyfunc are absolutely much faster than my list comprehension though, at least on larger datasets. A full dataset takes roughly 8.7 seconds with vectorize and 9.1 with frompyfunc. The same file needs 33 seconds with the list comprehension.
– David
Nov 9 at 8:37


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53204207%2ffastest-way-to-make-python-object-out-of-numpy-array-rows%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Academy of Television Arts & Sciences

L'Équipe

1995 France bombings