If you are building some machine learning solutions, there’s a chance you would need to build you training data sets.
“To properly train a predictive model, historical data must meet exceptionally broad and high quality standards. First, the data must be right: It must be correct, properly labeled, de-deduped, and so forth.” – Harvard Business Review
Either you are a researching team from university, or a startup, or even big player, below is some information that could help you to make better decision and building a better data set.
1. Use public training data sets for your Machine Learning?
There are some data sets that has been built and made for public usage. For example, if you are looking for semantic segmentation data sets, CityScapes could be a good one. Search for these data sets first, see if they match with your needs before spending money and effort for the other options. Although, this is likely to help only for simple usage or development purpose.
2. Build the training data in-house: Build your team, do it yourself.
Although this is an option, most of the cases, your team would not prefer to choose this. It is likely to be a waste of time and effort on the not-so-right tasks.
You will need to have an HR, recruit your crews, filter the right candidates, train them about the rules, have office desks for them, lay them off when the work is done. It’s pretty much headache for HR. At the end, it could be even more expensive than outsourcing to a specialized vendor.
The better use of time is focusing on what your team is best at, such as tuning the algorithms of system development.
3. Use a crowd-sourcing platform: A popular option.
Getting it done fast is what people think of when talking about this option.
However, accuracy could be a weakness of crowd-sourcing model where members are from anywhere, makes it difficult to control quality and consistency. Thus, if your project doesn’t require high accuracy of training data, you can just pick a few platforms and compare the price, see which one fits.
The cost of crowd-sourcing is expected to be lower than hiring a dedicated contractor. However, it is not always the case. So it hurts nothing by also requesting for quote from a few dedicated contractors, just for your own evaluation.
4. Hire a contractor who provides annotation (or BPO) services.
If you need accurate results, then I would suggest you go with hiring a contractor, which has dedicated team and dedicated Project Manager.
The first key point of managing the quality is understanding about annotation rules which are various for different projects. The one sentence that has been written down in the rules could be interpreted in multiple ways by each staff. This could lead to inconsistency of annotation results.
The second key point, is about quality control, through out the annotation process. It’s not only at the end of the cycle because the impact would be huge. For example, if one rule is not correctly understood, and you don’t find out until the end of the day, the impact could be millions of photos, which is not feasible to review and fix all the errors.
In contrast, by having a dedicated team, you can keep close communication with the team, ensure every rule is understood. In case of any mistakes, you can immediately feedback feedback to the team and get it fixed.
Conclusions
From what I see, the key points are price & expected accuracy that decides which option and which vendor you should go with.
So, don’t hesitate to list out some vendors, request for quote, ask for a small pilot if needed, then make your decision.