Want to build a data business? Here’s the 7 ways to collect data

·

,

So you wanna build a data business?

One of the things I didn’t understand when starting. CB Insighs was the many ways you can collect and create the information that underpins a data business.

As we’ve built CBI and as I’ve studied others, I’ve realized there are 7 primary strategies employed for data collection & creation (herein data collection)

I’ll describe each below with some examples. Notes:

Most data companies that have been around for a while employ multiple of these tactics although they typically start with 1, maybe 2. For example, CB Insights uses 5 of the 7 data collection methods now but we started with just 1.

Some of these data collection methods are inherently more valuable than others. I don’t discuss that below but may if this niche’y topic is of interest to folks. If you’re building a data business (or interested in this space), hope this is helpful. For other data business builders, keep me honest in the comments if I’ve missed anything.

The 7 data collection methods

1. Ground & pound

2. “Web scraping”

3. UGC

4. Pooled / data consortiums

5. Survey & Interviews

6. Sawdust collection

7. God algorithms

———————–

1. Ground & Pound

————————

Directly collecting data through manual effort, often involving significant human resources.

This is the way a lot of people start.

Data businesses often require a high pain tolerance in the beginning. This is good and bad. Good in that not everyone likes doing data janitor work. Bad in that you’re doing data janitor work. Examples

ZoomInfo: Initially collected contact information by cold-calling companies.

Waze: Initially relied on drivers to manually report traffic conditions, accidents, and hazards.

CB Insights: We initially extracted company and transaction information from reading and manually parsing 50,000 articles. Once we saw patterns there, it informed the machine learning capabilities we eventually built.

———————–

2. “Web scraping”

————————

Aggregating, linking, and harmonizing data from various public and third-party sources, often unstructured or semi-structured.

This often involves web crawling, machine learning, etc.

I say “web scraping” in quotes because people seem to think this is easy “Oh, so you have bots?”. Extracting data from semi-structured and unstructured documents in a high fidelity way is not easy to do. Examples

Meltwater: Collects data from online news and social media for media intelligence that they sell to comms and PR teams.

CoStar: Aggregates data from public records, such as property tax records, deeds, and zoning information, to enhance their real estate database.

Bloomberg: Aggregates financial data from multiple sources and provides harmonized datasets.

———-

3. UGC

———-

Users contribute content. These are often lead gen or advertising revenue models vs direct data subscription businesses.

This is the model most at risk from GAI and resulting changes to Google. Examples

Yelp

Glassdoor

————————————

4. Pooled / Data Consortium

————————————-

Also referred to as Data Co-operatives or Data Co-Ops, pooled data typically entails collecting data from multiple entities within an industry to gain broader insights and more comprehensive datasets.

Typically, contributors of the data are provided back anonymized evaluation benchmarks and analytics that help them operate their businesses better.

Examples

Equifax: Part of a consortium of credit bureaus sharing data for comprehensive credit reporting.

Verisk: Collect insurance industry data

Raiser’s Edge NXT: Pooling fundraising and donor data across nonprofits for better insights.

Pave & Payscale: Gathers salary information from employers and employees to create a comprehensive database for compensation analysis.

IQVIA: Collects data directly from clinical trials and research studies conducted worldwide. (formerly IMS Health & Quintiles)

——————————

5. Surveys & Interviews

——————————-

Collecting data through structured surveys, interviews, and questionnaires.

Examples

Nielsen: Using TV diaries, online surveys, and in-person interviews to gather data on viewing habits and consumer preferences.

JD Power: Known for its customer satisfaction and product quality surveys in various industries including automotive, healthcare, and finance.

Gallup: Gallup is known for its public opinion polls and surveys, which cover a wide range of topics including politics, economy, and social issues.

CB Insights: We interview software buyers and make those transcripts available along with extracting structured data from those conversations around metrics like pricing, CSAT, etc.

—————————-

6. Sawdust Collection

—————————-

Creating valuable data as a byproduct of a company’s primary operations or services.

Examples

Google: Generates data from search queries and user behavior on its platforms.

Slice: Shopping app organized order confirmations and shipping notifications that retailers e-mail to consumers after purchase. Slice mined this for insights into purchasing trends they could sell to marketers, investors, etc. Slice was acquired by Rakuten.

John Deere: Collects agricultural data from sensors on farming equipment. For example, harvesting equipment with sensors that measure crop yield, moisture levels, and operational efficiency which can be used to analyze field performance and improve future yields..

Linkedin: Insights on where people being hired are coming from, what roles they are hiring in, growth trends, etc comes from the updates people make to their core individual profiles

Yodlee/Envestnet: Data on financial transactions, i.e. credit cards for example, While they don’t provide directly identifiable information, they do provide aggregate data that can help analysts understand how Uber or Instacart’s revenue might be looking.

———————–

7. God Algorithms

———————–

Using advanced algorithms to create valuable data products from existing data.

Typically requires market credibility/blessing before introduction

Examples

Zillow: Uses the Zestimate algorithm to estimate property values.

FICO: Generates credit scores based on proprietary algorithms analyzing financial data.

CB Insights: Our private company health score (Mosaic) and Commercial Maturity Score are created using algorithms. So those are the 7 ways to collect data upon which to build a data business. However, just because you can collect data doesn’t mean you have a potential business. You need to understand if that data is valuable. If there is interest, I’ll dig into what makes data valuable in a separate post


Discover more from Anand Sanwal

Subscribe to get the latest posts sent to your email.

4 responses to “Want to build a data business? Here’s the 7 ways to collect data”

  1. This is an insightful list of how data businesses acquire their data. I am interested to hear your thoughts on how to gauge the value of the collected data.

  2. This is a great list. I am curious to know the techniques some companies use to sell or commercialize the data.

  3. […] covered the 7 methods to acquire data for a data business here with examples of companies doing […]

  4. @Anand how did you get your first 50 customers after doing your ECO and validating the idea?

Leave a Reply

Discover more from Anand Sanwal

Subscribe now to keep reading and get access to the full archive.

Continue reading