Filtering data stored in csv files





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







-2















I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.



Fields in csv file:



Stock Quantity Price PurchaseDate 


Querying ability examples:




  1. Ability to query purchases made in a given duration.

  2. Ability to query purchases made for a specific stock.

  3. Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.


Questions:




  1. How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?

  2. Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?

  3. I don't want to read the file content again if data in the file has not changed, how can I memoize this data?


Generic API signature and implementation ideas are really appreciated.



Sample Data:



MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006









share|improve this question




















  • 2





    post a minimal and testable input data

    – RomanPerekhrest
    Nov 25 '18 at 7:01











  • added sample data

    – Lokesh Agrawal
    Nov 25 '18 at 7:25











  • The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.

    – Shiv Rajawat
    Nov 25 '18 at 7:51


















-2















I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.



Fields in csv file:



Stock Quantity Price PurchaseDate 


Querying ability examples:




  1. Ability to query purchases made in a given duration.

  2. Ability to query purchases made for a specific stock.

  3. Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.


Questions:




  1. How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?

  2. Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?

  3. I don't want to read the file content again if data in the file has not changed, how can I memoize this data?


Generic API signature and implementation ideas are really appreciated.



Sample Data:



MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006









share|improve this question




















  • 2





    post a minimal and testable input data

    – RomanPerekhrest
    Nov 25 '18 at 7:01











  • added sample data

    – Lokesh Agrawal
    Nov 25 '18 at 7:25











  • The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.

    – Shiv Rajawat
    Nov 25 '18 at 7:51














-2












-2








-2








I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.



Fields in csv file:



Stock Quantity Price PurchaseDate 


Querying ability examples:




  1. Ability to query purchases made in a given duration.

  2. Ability to query purchases made for a specific stock.

  3. Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.


Questions:




  1. How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?

  2. Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?

  3. I don't want to read the file content again if data in the file has not changed, how can I memoize this data?


Generic API signature and implementation ideas are really appreciated.



Sample Data:



MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006









share|improve this question
















I have a csv file that contains my portfolio information (i.e. records of stocks purchased in last 20 years). File has around millions of records. I want to write a python function that reads this files and returns filtered data.



Fields in csv file:



Stock Quantity Price PurchaseDate 


Querying ability examples:




  1. Ability to query purchases made in a given duration.

  2. Ability to query purchases made for a specific stock.

  3. Ability to query information like stocks purchased in quantity greater then 50 units, stocks purchased for less then 50$ etc.


Questions:




  1. How will the signature of search_portfolio method look like? Specifically, how will users provide all these filtering information?

  2. Which efficient data structure should be used to read and filter this information from file? Should I read this Pandas DF or use something like Python dicts? Why one and not the other?

  3. I don't want to read the file content again if data in the file has not changed, how can I memoize this data?


Generic API signature and implementation ideas are really appreciated.



Sample Data:



MSFT,50,100,23/02/2018
APPL,20,901,03/02/2010
AMZN,50,210,02/11/2005
MSFT,21,103,25/02/2015
MSFT,54,101,14/06/2014
AMZN,80,230,09/08/2009
MSFT,35,100,13/09/2007
GOOG,24,830,08/02/2006
MSFT,45,102,18/07/2007
GOOG,30,701,03/03/2006






python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 26 '18 at 17:32









E_net4

13.1k73873




13.1k73873










asked Nov 25 '18 at 6:49









Lokesh AgrawalLokesh Agrawal

5971033




5971033








  • 2





    post a minimal and testable input data

    – RomanPerekhrest
    Nov 25 '18 at 7:01











  • added sample data

    – Lokesh Agrawal
    Nov 25 '18 at 7:25











  • The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.

    – Shiv Rajawat
    Nov 25 '18 at 7:51














  • 2





    post a minimal and testable input data

    – RomanPerekhrest
    Nov 25 '18 at 7:01











  • added sample data

    – Lokesh Agrawal
    Nov 25 '18 at 7:25











  • The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.

    – Shiv Rajawat
    Nov 25 '18 at 7:51








2




2





post a minimal and testable input data

– RomanPerekhrest
Nov 25 '18 at 7:01





post a minimal and testable input data

– RomanPerekhrest
Nov 25 '18 at 7:01













added sample data

– Lokesh Agrawal
Nov 25 '18 at 7:25





added sample data

– Lokesh Agrawal
Nov 25 '18 at 7:25













The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.

– Shiv Rajawat
Nov 25 '18 at 7:51





The reason people will downvote it because you didn't post what have you tried so far. make a habit of posting the tried code snippet's so that people can help you better.

– Shiv Rajawat
Nov 25 '18 at 7:51












1 Answer
1






active

oldest

votes


















2














First, read the data from csv to a pandas dataframe.



import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)


Now if you want to select stocks purchased between two dates



print(df.loc['2010-07-01' : '2010-10-02'])


Similarly if you want data before/after a particular date, use



print(df.loc['2010-07-01' : ])   ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date


If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.



for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)


Or



for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)


There are many other ways you can achieve this. Just play around it and you'll get to know.






share|improve this answer
























  • Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

    – Lokesh Agrawal
    Nov 25 '18 at 7:51






  • 1





    Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

    – Shiv Rajawat
    Nov 25 '18 at 7:53











  • Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

    – Shiv Rajawat
    Nov 25 '18 at 7:57











  • If it worked, kindly accept the answer.

    – Shiv Rajawat
    Nov 25 '18 at 14:12











  • It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

    – Lokesh Agrawal
    Nov 25 '18 at 14:17












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465311%2ffiltering-data-stored-in-csv-files%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














First, read the data from csv to a pandas dataframe.



import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)


Now if you want to select stocks purchased between two dates



print(df.loc['2010-07-01' : '2010-10-02'])


Similarly if you want data before/after a particular date, use



print(df.loc['2010-07-01' : ])   ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date


If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.



for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)


Or



for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)


There are many other ways you can achieve this. Just play around it and you'll get to know.






share|improve this answer
























  • Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

    – Lokesh Agrawal
    Nov 25 '18 at 7:51






  • 1





    Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

    – Shiv Rajawat
    Nov 25 '18 at 7:53











  • Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

    – Shiv Rajawat
    Nov 25 '18 at 7:57











  • If it worked, kindly accept the answer.

    – Shiv Rajawat
    Nov 25 '18 at 14:12











  • It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

    – Lokesh Agrawal
    Nov 25 '18 at 14:17
















2














First, read the data from csv to a pandas dataframe.



import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)


Now if you want to select stocks purchased between two dates



print(df.loc['2010-07-01' : '2010-10-02'])


Similarly if you want data before/after a particular date, use



print(df.loc['2010-07-01' : ])   ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date


If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.



for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)


Or



for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)


There are many other ways you can achieve this. Just play around it and you'll get to know.






share|improve this answer
























  • Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

    – Lokesh Agrawal
    Nov 25 '18 at 7:51






  • 1





    Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

    – Shiv Rajawat
    Nov 25 '18 at 7:53











  • Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

    – Shiv Rajawat
    Nov 25 '18 at 7:57











  • If it worked, kindly accept the answer.

    – Shiv Rajawat
    Nov 25 '18 at 14:12











  • It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

    – Lokesh Agrawal
    Nov 25 '18 at 14:17














2












2








2







First, read the data from csv to a pandas dataframe.



import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)


Now if you want to select stocks purchased between two dates



print(df.loc['2010-07-01' : '2010-10-02'])


Similarly if you want data before/after a particular date, use



print(df.loc['2010-07-01' : ])   ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date


If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.



for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)


Or



for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)


There are many other ways you can achieve this. Just play around it and you'll get to know.






share|improve this answer













First, read the data from csv to a pandas dataframe.



import pandas as pd
# parse dates from column index 3: Purchasedate
df=pd.read_csv('file_name.csv', parse_dates=True, index_col=3)


Now if you want to select stocks purchased between two dates



print(df.loc['2010-07-01' : '2010-10-02'])


Similarly if you want data before/after a particular date, use



print(df.loc['2010-07-01' : ])   ## All records after given date
print(df.loc[ : '2010-10-02']) ## All records before given date


If you want to filter records based on some condition, simply iterate through entire dataframe and use conditional statement to filter them.



for index, row in df.iterrows():
if(row['Stock'] == 'MSFT'):
print(row)


Or



for index, row in df.iterrows():
if(row['Price'] > 100):
print(row)


There are many other ways you can achieve this. Just play around it and you'll get to know.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 25 '18 at 7:48









Shiv RajawatShiv Rajawat

31019




31019













  • Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

    – Lokesh Agrawal
    Nov 25 '18 at 7:51






  • 1





    Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

    – Shiv Rajawat
    Nov 25 '18 at 7:53











  • Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

    – Shiv Rajawat
    Nov 25 '18 at 7:57











  • If it worked, kindly accept the answer.

    – Shiv Rajawat
    Nov 25 '18 at 14:12











  • It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

    – Lokesh Agrawal
    Nov 25 '18 at 14:17



















  • Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

    – Lokesh Agrawal
    Nov 25 '18 at 7:51






  • 1





    Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

    – Shiv Rajawat
    Nov 25 '18 at 7:53











  • Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

    – Shiv Rajawat
    Nov 25 '18 at 7:57











  • If it worked, kindly accept the answer.

    – Shiv Rajawat
    Nov 25 '18 at 14:12











  • It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

    – Lokesh Agrawal
    Nov 25 '18 at 14:17

















Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

– Lokesh Agrawal
Nov 25 '18 at 7:51





Thanks for the answer. Is Pandas DF an optimal data structure for solving such problems in Python? What about search_portfolio methods signature and what about caching the data?

– Lokesh Agrawal
Nov 25 '18 at 7:51




1




1





Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

– Shiv Rajawat
Nov 25 '18 at 7:53





Yes, if you are dealing with 1000s of records pandas dataframe is the best data structure. I

– Shiv Rajawat
Nov 25 '18 at 7:53













Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

– Shiv Rajawat
Nov 25 '18 at 7:57





Create methods for all different possible combination of conditions that user can use to filter data. You can even do it in one method, but it'll require some effort from your side. For caching data, use temporary dataframes to store results, and if you are interested in only one column then use list to store results.

– Shiv Rajawat
Nov 25 '18 at 7:57













If it worked, kindly accept the answer.

– Shiv Rajawat
Nov 25 '18 at 14:12





If it worked, kindly accept the answer.

– Shiv Rajawat
Nov 25 '18 at 14:12













It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

– Lokesh Agrawal
Nov 25 '18 at 14:17





It will work, but I am still looking for some opinions around caching of data and suggestions around generic method signature. Creating multiple methods for different conditions would be cumbersome and may not be maintainable in long run.

– Lokesh Agrawal
Nov 25 '18 at 14:17




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465311%2ffiltering-data-stored-in-csv-files%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Hercules Kyvelos

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud