I have some huge dataset that i need to summarize in to smaller subset of data. Similar to averaging but far more sophisticated. Id like to use some distance metrics (Euclidean, Chebychev )to run through the entire dataset and create groups of representations of the data. For example
I have 5 million records but it is not reasonable to run our application against all 5 milion so i would rather agree to come up with 100,000 records that fairly represent the data and then run my program against that. I dont want to pick records at random, id rather analyze the field of data that i have and then create a sub group. So i have N samples and I want X groups
the idea is
1) Settle on a Distance Metric you are comfortable with that compares only 2 samples at a time (not sequences) [Euclidean, Chebychev, etc.] You want to do this so you can compare your k-dim sample to the origin. This will normalize all samples to just a "distance from the origin" in space number. Now, each k-dim sample is really represented as a single number for the purposes of clustering (a.k.a. grouping).The beauty is these can be precomputed and stored with the samples.
2) To group them, select all the samples, but order by the Distance Metric. If you want X groups, and have N total samples, then partition them every N/X elements.
Performance is huge for this project. our data sets will range from 100K to 200million rows. and your app has to handle all of it. Ideally it can take more than 1 core for the process to help with time and it needs to be really smart with ram since we wont have 20gb of ram. So we need to assume some performance testing against 2gb ram machine vs 8gb etc..
I need really well documented code on this and plenty of room to test it in your estimate
## Deliverables
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
c++, C#, sql