In recent years, data availability limitations as an analytical obstacle have been replaced by data superabundance and dataset incompatibilities. Dealing effectively with this new fundamental problem has become a major task in much of our analytical work.
Builders of data warehouses deal with this issue daily as new data sources are integrated into existing architectures and definitions. Analysts are faced with the same problem but typically only have to solve it for a current purpose. Our data hell is usually less painful but still hell.
Our commercial demographics dataset had an establishment ID, name, zip code, and SIC code set as primary identifiers. The company's sales dataset had customer number, ship-to number, name, and zip code as primary identifiers. Linking these up should have been a piece of cake, except for a few problems like these:
1. Demographics data names matched to company sales data names only about half the time.
2. Demographics data zip codes were often the headquarters location, whereas sales data zip codes were commonly ship-to locations.
3. Demographics data for multi-location businesses in some cases was the same for each location, not broken down. This appeared to depend on whether the company reported employee count by location, since employees were an obvious factor in making market potential projections.
4. Sales data names often contained helpful notes such as "***DO NOT USE***" or references to a master account.
5. Sales data names often were subsidiary or division names while demographics data names were corporate names.
6. Abbreviations were common and in great variety. Ditto for typos. In both datasets.
This should give you enough flavor without going through the full list. When you are trying to match around 100,000 company sales names with 10 million demographics names, these problems must be handled almost entirely by machine.
The good news here is that it is nearly always possible to overcome such data difficulties. It just takes a lot of work, patience, and experience. If your company is thinking of developing a sales territory potential metric, then you should be aware of the magnitude of this common front-end task.
When you have linked your demographic data to your customer accounts, you are ready to develop initial estimates of account sales potential. There are a couple of tricks you need to know about in doing this ... more ...
Often, elegant approaches are not available and brute force is the only option. This is not a purely manual effort but combines ad hoc algorithms with visual checking.
The algorithms are mostly quick-and-dirty types generated on the fly to handle a particular type of data problem. Visual inspection is used to develop and check the machine result.
As each algorithm is built and tested, it is added to the machine analytical framework. Eventually, you get to a point at which there are only a manageable number of data problems that must be handled manually and individually.
As you can readily see, this is not a fast process. The good news is that, once you have matched your datasets, you do not have to do it again. It is a one-time effort and only needs to be updated for new accounts and changes.
People building and maintaining data warehouses will struggle with such problems almost forever.