Description of Data
The data provided represents technology companies organized by year and year founded with location (city, state, zip), sales, employment, primary industry, and product types. The industry and product type classifications come from the North American Industry Classification System (NAICS). It is currently a subset of a database being used for research at UMass Lowell. In addition, a comma delimited data file (zipdata.csv) containing 5-digit zip codes and lat/longs is provided.
Detailed NAICS information can be found at http://www.census.gov/epcd/www/naics.html
The CompanyDataXX file includes the following information for individual high tech companies for each year (XX) in the United States for the years 1989 to 2003 (15 years).
2 Company ID (number)
3-5 Address (city, state and zip)
6 Industry type (chemicals, energy, medical, software, etc.) - code description is in the file IndustryCodes
7 Year formed (founded)
8 Primary NAICS (government company classification code)
9 Sales in Millions
10 Employment Count
The files ProductDataXX represent the products classification per year for each company. These files may contain more records than Excel can open.
2 Company ID (same as above)
3 NAICS product code
4 Product Verbal Description (providing more details on the product than the NAICS code)
There are 87,659 companies in the complete data set. About 60,000 companies are included for 2003 for example.
There is one company data entry for each company for each year.
There are multiple product data entries for each company for each year due to the fact that companies typically produce multiple products.
The company data and production data can be related through the id and year combination.
There is missing data.
The web site for the full NAICS codes is http://www.census.gov/epcd/naics02/naicod02.htm
Companies in each NAICS code are all searchable from the data by region and year. Remember, government sources do not supply company specific information, only the totals and only for geographical areas where there are enough companies so that specific companies can not be identified.
The data contains more. Each NAICS code can help identify companies for which the specific NAICS code is the company’s primary code or primary industry and, incredibly, the same NAICS code identify companies that make a product that fits the NAICS code.
The zipcode data is in zipdata.csv which is a comma delimited file with the following structure:
1. Zip code
1. The database is searched for companies in which MED is the ‘Primary Industry’. Spreadsheets are created which include all such companies by region, state, zip code, or city.
2. The companies on the spreadsheet are sorted by company NAICS code. This gives a distribution of companies by the whole range of NAICS codes, including both manufacturing and various services.
3. Step 1 is repeated but the data is searched for companies which have MED products. And again the companies on the spreadsheet are sorted by NAICS code. This gives a much larger group of companies not only in the services sectors but in manufacturing sectors as well.
4. Regions can be compared in terms of the different arrays of NAICS codes that make up MED, or any other technology group or even any major product code.
5. The same region can be compared over time.
6. Fast growing regions by technology groups can be found and compositions compared over time.
1. Characterize correlations or other patterns among two or more variables in the data.
What products lead to growth in other products or industries?
What contributes to companies moving, and what characterizes the moves?
2. Characterize clusters of products, industries, sales, regions, and/or companies.
What geographical areas developed in a similar manner or have similar characteristics?
What product combinations tend to be produced by a company, or in a region?
3. Characterize unusual products, sales, regions, or companies.
Are there regions whose product mix changes in an unusual direction?
Are there products whose sales per employee varies geographically?
4. Characterize any other trend, pattern, or structure that may be of interest.
Note: unfortunately the economical data was made available only for the 2005 contest and we are not allowed to post it in this repository. Nevertheless we decided to post the results of the contest. Researchers interested in using the data for economic research should contact Michael Best at "Michael_Best@uml.edu"
Any additional questions send to: