Introduction

This project asks a single question: what county-level demographic and economic characteristics best predict the rate of new business formation across the United States, and how accurately can we predict a county’s startup activity from its profile alone?

Small business formation is widely treated as a leading indicator of regional economic vitality. Counties that consistently generate new firms tend to see stronger long-term job creation, wage growth, and economic mobility (Glaeser, Kerr, Carlino). For policymakers, predicting where new businesses are likely to form matters for several practical reasons. A federal agency such as the Small Business Administration (SBA) can target technical-assistance grants toward counties that look poised to grow. A state economic-development office can identify counties under-performing relative to their demographic profile and prioritize them for workforce programs. A journalist or academic researcher can ask, “after controlling for income and education, do places with more foreign-born residents really see more new businesses?” The model built in this project supports all three of those questions.

Target variable

The target is the county-level firm startup rate, defined as new establishments per 1,000 residents in 2023. It is computed from the U.S. Census Bureau’s Business Dynamics Statistics (BDS) county-level release as estabs_entry / total_population × 1,000. After cleaning, the modeling sample contains 3,069 U.S. counties; the rate ranges from 0 to 15 with a mean of approximately 2 startups per 1,000 residents and a roughly right-skewed distribution.

Predictors

Six predictors, all drawn from the U.S. Census American Community Survey (ACS) 5-Year Estimates, 2019–2023, were chosen because each represents a hypothesized channel through which county-level demographics might shape entrepreneurial activity:

  • Median household income. A coarse proxy for purchasing power and access to capital. Higher-income counties may have more residents able to fund a new venture.
  • Share of adults aged 25 and over with a bachelor’s degree or higher. A standard human-capital measure. Knowledge-intensive entrepreneurship tends to cluster in well-educated areas.
  • Share of population that is foreign-born. Immigrant populations are documented to start firms at higher rates than native-born populations, often in services and retail.
  • Civilian labor force participation rate. A measure of economic engagement; counties where more working-age residents are economically active may sustain more new businesses.
  • Mean commute time in minutes. A proxy for urban density and labor-market dynamism: longer commutes typically reflect denser metropolitan structure, which correlates with entrepreneurial agglomeration.
  • Share of population aged 25 to 44. The prime entrepreneurship age cohort; younger adults are over-represented among new founders.

These six variables collectively span economic capacity, human capital, demographic composition, and structural dynamism. Their relative importance and any nonlinear interactions are what the modeling work in the following sections is designed to surface.

How the model is used

Given a county’s six predictor values, each fitted model returns a predicted startup rate. The intended use is cross-sectional inference: the model never sees BDS data at prediction time. It only sees ACS demographics, then estimates how entrepreneurially active that county is likely to be. A user can:

  1. Plug in the actual ACS values for a county and compare the prediction to that county’s observed BDS rate; counties whose actual rate falls noticeably above the prediction are over-performing given their demographics, and vice versa.
  2. Plug in hypothetical demographics (e.g., a county whose education share is shifted by ten percentage points) to read off the model’s estimated effect on the startup rate, holding everything else equal.

Data Source

The modeling table for this project was constructed from two primary federal sources, both pulled directly from the U.S. Census Bureau and joined on the standard 5-digit FIPS county code.

Target variable

The target variable, county-level firm formation, comes from the Business Dynamics Statistics (BDS) county time series, 2023 release (U.S. Census Bureau, 2024a). The BDS is constructed from the Longitudinal Business Database, which links Census Bureau employment records over time to track every U.S. private-sector employer establishment from its first appearance through any name changes, ownership transfers, and eventual exit. The estabs_entry field used here counts establishments that were not present in the prior year and have positive employment in the current year. Data are released annually with roughly a two-year lag: the 2023 release was published in late 2024 and contains data through fiscal year 2023. I downloaded the county time-series CSV directly from www2.census.gov/programs-surveys/bds/tables/time-series/2023/bds2023_st_cty.csv.

Predictors

The six predictor variables come from the American Community Survey (ACS) 5-Year Estimates, 2019 to 2023, county level (U.S. Census Bureau, 2024b). The ACS is a continuous monthly household survey that samples about 3.5 million addresses per year, with the 5-year file pooling sixty months of responses to deliver county-level estimates for even the smallest counties. Five-year estimates are the only ACS product available for all 3,000-plus U.S. counties; the more current 1-year estimates only cover counties with population above 65,000. The variables I used are derived from standard B-tables: B19013 (median household income), B15003 (educational attainment for the population aged 25 and over), B05002 (place of birth by nativity), B23025 (employment status for the population aged 16 and over), B08013 with B08303 (aggregate and total travel time to work, used to compute mean commute), and B01001 (sex by age, used to derive the share of population aged 25 to 44). All ACS variables were pulled via the Census Data API at api.census.gov/data/2023/acs/acs5.

Acquisition pipeline

The end-to-end pipeline lives in the companion 01_data_pull.Rmd. It downloads the BDS county CSV directly from www2.census.gov, calls the Census Data API for the ACS variables, joins the two sources on FIPS code, computes the target rate (estabs_entry / total_population × 1,000) along with four derived percentage predictors, filters out counties with population below 1,000 (too noisy to model), and writes the cleaned modeling table to data/panel_clean.csv. After the population filter and removal of any rows with missing predictor values, 3,069 counties survive into the modeling sample.

Reliability and limitations

The BDS and the ACS are widely used in academic and policy research and are the most authoritative federal sources for, respectively, business dynamics and county-level demographics. Both have known limitations relevant to this analysis:

  1. Cell-suppression in BDS. The BDS suppresses entries for very small population groups to protect business confidentiality. A handful of small-population counties drop out of the BDS file entirely, which biases our sample slightly toward larger counties.
  2. Sampling error in ACS. Because the ACS is a sample (not a census), each estimate carries a margin of error that is largest for small counties. I did not weight observations by, or filter on, ACS margin of error, which introduces some noise into predictors for the smallest counties in the sample.
  3. Single-year cross-section. Modeling the year 2023 alone means we are learning a snapshot rather than how counties trend over time. A county that is rapidly gaining or losing population will look identical to our model whether it is on a long-term boom or bust trajectory.
  4. No spatial structure. Counties next to each other are not statistically independent, but our models treat each county as an independent observation. This is acknowledged again in the limitations section at the end of the report.

Citations

U.S. Census Bureau. (2024a). Business Dynamics Statistics, county time series, 2023 release [Data set]. https://www.census.gov/data/datasets/time-series/econ/bds/bds-datasets.html

U.S. Census Bureau. (2024b). American Community Survey 5-Year Estimates (2019 to 2023), county level [Data set]. Retrieved via the Census Data API at https://api.census.gov/data/2023/acs/acs5

Exploratory Analysis

This section examines the modeling table from three angles: how the target itself is distributed, how each predictor is distributed, and how each predictor relates to the target. The 3,069 counties in the modeling sample give us enough data to read each pattern visually with confidence.

Distribution of the target

The startup-rate distribution is right-skewed. Most counties cluster between roughly 1 and 3 new establishments per 1,000 residents, with the modal county sitting near 1.5. A long thin tail extends past 10, accounting for the small number of counties with disproportionately high entrepreneurial activity. The mean (about 2.0) is pulled noticeably above the median (about 1.85) by that tail. Practically, this skew suggests that a relatively small fraction of counties drive a meaningful share of total business formation, and it raises a possible modeling decision (whether to log-transform the target) that we revisit in the modeling section.

Predictor distributions

Three predictors are visibly right-skewed (median_income, pct_bachelors_or_higher, pct_foreign_born), which is consistent with the United States having a concentrated set of high-income, highly educated, and immigrant-rich metropolitan areas alongside a much larger mass of more typical counties. Three are roughly symmetric (lfp_rate, mean_commute, pct_25_44), suggesting that labor force participation, commute times, and prime-age population share vary across counties without the long thin tail that characterizes income or education. The skewness in income, education, and foreign-born share matters for modeling: linear methods can be sensitive to skewed predictors, while tree-based methods are largely robust to it.

How predictors relate to the target

The LOESS curves give us a first read on each relationship before any modeling assumptions are imposed. The strongest signal visible to the eye is a positive, near-linear trend with pct_bachelors_or_higher: counties at the high end of education show predicted startup rates roughly three times those at the low end. median_income shows a weaker but still positive trend across its full range. pct_foreign_born is mostly flat with a small upward lift at very high values (counties with large immigrant populations). mean_commute is slightly negative or flat, which is counterintuitive if we expected denser metropolitan areas to drive entrepreneurship; whether that effect survives once we control for income and education is a question for the multivariate models. The two most interesting shapes are lfp_rate and pct_25_44, both of which sit on a flat plateau through most of their range and then curve sharply upward only at the very high end; counties with unusually high labor-force participation or unusually large prime-age populations show meaningfully higher startup rates, while typical counties show no clear trend.

These are univariate views. The same predictor can look influential here and turn out redundant once we control for others, or the reverse. The nonlinear shapes in lfp_rate and pct_25_44 are also exactly the kind of pattern that a linear LASSO will smooth out but a Random Forest will pick up; the comparison between models in the next section is where this becomes visible.

Geographic pattern

Mapping the target onto U.S. counties reveals a regional pattern that is harder to see in the histograms. Higher startup rates concentrate visibly in the Sun Belt (Florida especially) and across scattered counties of the Mountain West, while the most extreme individual values often appear in sparsely populated rural counties where a small population denominator inflates the per-capita rate. The Midwestern farm belt and much of the Northeast show consistently lower rates. Geographic position is not in our predictor set, which means our models must explain this regional pattern entirely through the demographic and economic variables: population age structure, education, immigration, income, commuting, and labor force participation. What the predictors cannot capture is taken up in the limitations section.

Models and Interpretation

Design decisions

I fit three regression models on the same 80/20 train-test split, with all hyperparameter tuning performed by 10-fold cross-validation on the training portion only and final accuracy assessed on the held-out 20% test set. The three models were chosen to span three different modeling philosophies so that comparing them surfaces different aspects of the predictor-target relationship rather than just three flavors of the same idea:

  1. LASSO regression (linear, parametric, with shrinkage). Penalizes the sum of absolute coefficients, regularizing against overfitting and performing automatic variable selection. Coefficients are interpretable on the original scale, and the procedure gives a clear answer to “which predictors matter, holding the others fixed?”.
  2. Random Forest regression (tree-based ensemble). Grows many decision trees on bootstrap samples, randomly subsets the predictor pool at each split, and averages predictions. Captures nonlinearities and interactions that linear models miss, at the cost of interpretability.
  3. kNN regression (nonparametric, distance-based). Each prediction is the average startup rate of the \(k\) training counties closest to the query county in standardized predictor space. Makes no functional-form assumption.

All three were given the same 6 predictors and the same outcome. LASSO’s regularization parameter \(\lambda\), RF’s mtry, and kNN’s \(k\) were tuned by cross-validation, and the chosen values were \(\lambda = 0.013\), \(\text{mtry} = 2\), and \(k = 30\).

Performance comparison

Test-set performance: 80/20 hold-out, 10-fold CV for tuning
Model RMSE Rsquared MAE
LASSO 0.8964 0.2584 0.5322
Random Forest 0.8792 0.2827 0.5084
kNN 0.8780 0.2980 0.4922

All three models produced similar test-set accuracy. The kNN regressor edged out the other two on every metric (RMSE 0.878, R-squared 0.298, MAE 0.492); Random Forest came second by small margins, and LASSO trailed both. The differences between models, however, are smaller than the difference between any one of them and a simple baseline of “always predict the overall mean”: the standard deviation of the target is 0.95, so a constant-mean predictor would post RMSE 0.95 and R-squared 0. All three models therefore add real signal over an uninformative baseline.

That the three modeling families converge to similar accuracy is itself a finding. It tells us that most of the signal in this dataset is already captured by linear contributions of the six predictors, with only modest additional gains from nonlinearity. Random Forest’s flexibility helps a little; kNN’s smooth local averaging helps a little more. But the bulk of predictive value sits in the strong, near-linear relationship between education and startup rate that even LASSO captures.

Typical error in context

The held-out test RMSE of approximately 0.88 means the models’ typical prediction error is about 0.88 new establishments per 1,000 residents. To translate that into a unit a county economic-development office would care about: a typical county in our dataset has roughly 60,000 residents, so an error of 0.88 per 1,000 residents corresponds to about 53 establishments missed in a typical year. For benchmarking, a subject-matter expert (e.g., an SBA grant analyst or a state-level economic-development planner) familiar with their region might predict county-level startup activity with a similar error band purely from local knowledge of major metros versus rural counties; the models are not meaningfully better than an informed expert at calling individual counties, but they offer the advantage of consistent error properties across all 3,000 U.S. counties simultaneously, where a human expert’s intuition only generalizes to areas they know.

For policy use cases, this error magnitude is enough to rank-order counties (the top 10% of predicted counties versus the bottom 10% are well-separated), but not enough to make fine distinctions among adjacent counties (a county predicted at 1.8 and another at 2.0 are statistically indistinguishable given our typical error). That asymmetry matters: the models are useful for triage (which counties merit closer attention) but not for individual scoring of any one county.

Variable importance and effect shapes

Across the three models, pct_bachelors_or_higher is the single most important predictor. Random Forest’s variable-importance plot put it at roughly 100 (normalized), with the second-place predictor at about 35. LASSO assigned it the largest IQR-weighted impact among the six predictors. kNN’s 25th-to-75th-percentile impact also placed it first. The relationship is consistently positive and roughly linear over most of its range, with a saturation effect above 60% bachelor’s-or-higher (Random Forest’s partial-dependence curve flattens above this threshold, suggesting that beyond a certain education density, additional college graduates do not translate into proportionally more new firms).

Three nonlinearities surface only when we leave LASSO’s linearity assumption behind:

  • mean_commute has a negative relationship with startup rate. RF’s partial-dependence plot shows a sharp drop from about 3.3 at very short commute times (5 to 10 minutes) down to about 2.0 by the 25-minute mark, then flat. This contradicted my prior expectation that denser metros should foster more entrepreneurship. The likely explanation: short-commute counties tend to be small-town and rural (where everyone works close to home), and these are precisely the counties whose per-capita rates are noisily inflated by small population denominators. With more years of data this effect would likely attenuate.
  • pct_25_44 shows a U-shape: counties with low or high shares of prime-age adults have the highest startup rates, with a dip in the middle. LASSO smoothed this away (linear coefficient -0.023, near zero); Random Forest preserved it. The U-shape is consistent with two different kinds of entrepreneurial activity: very young counties (university towns) and very old or family-stable counties at one end versus mid-prime-age suburban counties at the other.
  • lfp_rate has a J-shape (flat through most of its range, sharply upward only at the very high end). LASSO shrunk this predictor to exactly zero (did not select it); Random Forest assigned it importance ~20 (top half). The high-LFP tail likely captures a small set of high-employment, high-engagement counties that are statistically distinct from the rest.

These nonlinearities are exactly the part of the predictor-outcome relationship that linear methods cannot capture, and they are visible in Sections 5 and 6 (Random Forest partial-dependence and kNN marginal-effect plots) of the accompanying analysis workbook.

Predictions for selected counties

To make the modeling concrete, four counties were selected to span the range of profiles in the data (a programmatically chosen median-demographics county, plus three named urban, rural, and tech-metro counties). Predictions were generated from all three fitted models:

Predicted vs actual startup rate (per 1,000) for selected counties
County Actual LASSO RF kNN
Morgan County, AL (median demographic) 1.62 2.01 1.76 1.85
Cook County, IL (Chicago metro) 2.31 2.45 2.57 2.66
Stark County, IL (rural) 2.82 1.82 1.58 1.62
Travis County, TX (Austin) 3.99 2.99 3.84 3.41

Three observations from this table. First, all three models agree closely for typical counties (Morgan, Cook). The “three modeling philosophies” disagreement is concentrated in the tails. Second, Random Forest substantially outperforms LASSO and kNN on Travis County (Austin), the highest-rate county in our sample (RF predicts 3.84 against actual 3.99, an error of just 0.15; LASSO predicts 2.99, a full 1.0 below actual). This is exactly the high-end nonlinearity scenario where a tree-based model’s flexibility pays off. Third, all three models substantially under-predict Stark County’s actual rate of 2.82 (predictions range 1.58 to 1.82). This is the small-population-denominator artifact: Stark County’s high per-capita rate reflects a small handful of new establishments in a small county, and no demographics-only model can be expected to predict that.

In terms of expected accuracy for any given county, the test-set RMSE of 0.88 implies that any single prediction has roughly a 1-standard-deviation band of \(\pm 0.88\) around it. A user querying the model for, say, Cook County (predicted 2.5 to 2.7 by the three models) should treat the answer as “approximately 2 to 3 startups per 1,000” rather than as a point estimate.

Limitations and Future Work

The analysis above produces a usable but imperfect picture of county-level startup formation. Several limitations are worth naming.

Spatial autocorrelation. Counties next to each other share labor markets, supply chains, and customer bases. Every model in this project treats each county as statistically independent, which is conservative but understates the true joint information in the data. A spatial random-effects model (geographically weighted regression, or an ICAR random effect at the state or commuting-zone level) would be the principled fix and would likely improve test-set accuracy.

Single-year cross-section. Modeling the year 2023 alone gives us a snapshot, not a trajectory. A county on a long-term boom looks identical to a county at the peak of a one-year spike. Pooling several years of BDS and ACS into a panel (2018 through 2023, for example) would let us model county fixed effects, and would dampen the small-population-county noise that hurt the Stark County prediction.

Heavy-tail compression. All three models in this project compress predictions toward the mean. The Travis County example shows the closest any model came (Random Forest at 3.84 against actual 3.99), but the dataset’s rare extreme counties are inherently hard to predict from a six-predictor demographic snapshot. The handful of counties in the dataset with startup rates above 8 per 1,000 simply do not have enough company in predictor space to be learned reliably.

Predictor scope. Six predictors is conservative. Plausible additions, all available from primary public sources, include SBA 7(a) loan volume by county (a direct measure of access to capital), FCC broadband download speed by census tract (more granular than ACS adoption rates), state-level tax climate, and a categorical state or region effect. Each could plug into the same modeling pipeline.

Modeling alternatives. A generalized additive model (GAM) with smoothers on each predictor would produce smoother, more interpretable nonlinearities than Random Forest while still capturing the J-shape and U-shape relationships that LASSO smoothed away. Boosted trees (gbm) would likely improve raw RMSE further at the cost of more tuning. Both extensions would be straightforward to add to the existing workbook.

Interpretation caveat. The strongest single relationship in the data is between bachelor’s-degree share and startup rate, but this should not be read as causal. Counties with high education shares also have higher incomes, more young adults, more foreign-born residents, and denser metropolitan structures; the model can identify the correlate but not isolate which specific channel drives the effect. A causal estimate would require a research design (instrumental variables, difference-in-differences) outside the scope of this project.