This project asks a single question: what county-level demographic and economic characteristics best predict the rate of new business formation across the United States, and how accurately can we predict a county’s startup activity from its profile alone?
Small business formation is widely treated as a leading indicator of regional economic vitality. Counties that consistently generate new firms tend to see stronger long-term job creation, wage growth, and economic mobility (Glaeser, Kerr, Carlino). For policymakers, predicting where new businesses are likely to form matters for several practical reasons. A federal agency such as the Small Business Administration (SBA) can target technical-assistance grants toward counties that look poised to grow. A state economic-development office can identify counties under-performing relative to their demographic profile and prioritize them for workforce programs. A journalist or academic researcher can ask, “after controlling for income and education, do places with more foreign-born residents really see more new businesses?” The model built in this project supports all three of those questions.
The target is the county-level firm startup rate,
defined as new establishments per 1,000 residents in 2023. It is
computed from the U.S. Census Bureau’s Business Dynamics Statistics
(BDS) county-level release as
estabs_entry / total_population × 1,000. After cleaning,
the modeling sample contains 3,069 U.S. counties; the rate ranges from 0
to 15 with a mean of approximately 2 startups per 1,000 residents and a
roughly right-skewed distribution.
Six predictors, all drawn from the U.S. Census American Community Survey (ACS) 5-Year Estimates, 2019–2023, were chosen because each represents a hypothesized channel through which county-level demographics might shape entrepreneurial activity:
These six variables collectively span economic capacity, human capital, demographic composition, and structural dynamism. Their relative importance and any nonlinear interactions are what the modeling work in the following sections is designed to surface.
Given a county’s six predictor values, each fitted model returns a predicted startup rate. The intended use is cross-sectional inference: the model never sees BDS data at prediction time. It only sees ACS demographics, then estimates how entrepreneurially active that county is likely to be. A user can:
The modeling table for this project was constructed from two primary federal sources, both pulled directly from the U.S. Census Bureau and joined on the standard 5-digit FIPS county code.
The target variable, county-level firm formation, comes from the
Business Dynamics Statistics (BDS) county time series, 2023
release (U.S. Census Bureau, 2024a). The BDS is constructed
from the Longitudinal Business Database, which links Census Bureau
employment records over time to track every U.S. private-sector employer
establishment from its first appearance through any name changes,
ownership transfers, and eventual exit. The estabs_entry
field used here counts establishments that were not present in the prior
year and have positive employment in the current year. Data are released
annually with roughly a two-year lag: the 2023 release was published in
late 2024 and contains data through fiscal year 2023. I downloaded the
county time-series CSV directly from
www2.census.gov/programs-surveys/bds/tables/time-series/2023/bds2023_st_cty.csv.
The six predictor variables come from the American Community
Survey (ACS) 5-Year Estimates, 2019 to 2023, county level (U.S.
Census Bureau, 2024b). The ACS is a continuous monthly household survey
that samples about 3.5 million addresses per year, with the 5-year file
pooling sixty months of responses to deliver county-level estimates for
even the smallest counties. Five-year estimates are the only ACS product
available for all 3,000-plus U.S. counties; the more current 1-year
estimates only cover counties with population above 65,000. The
variables I used are derived from standard B-tables: B19013 (median
household income), B15003 (educational attainment for the population
aged 25 and over), B05002 (place of birth by nativity), B23025
(employment status for the population aged 16 and over), B08013 with
B08303 (aggregate and total travel time to work, used to compute mean
commute), and B01001 (sex by age, used to derive the share of population
aged 25 to 44). All ACS variables were pulled via the Census Data API at
api.census.gov/data/2023/acs/acs5.
The end-to-end pipeline lives in the companion
01_data_pull.Rmd. It downloads the BDS county CSV directly
from www2.census.gov, calls the Census Data API for the ACS
variables, joins the two sources on FIPS code, computes the target rate
(estabs_entry / total_population × 1,000) along with four
derived percentage predictors, filters out counties with population
below 1,000 (too noisy to model), and writes the cleaned modeling table
to data/panel_clean.csv. After the population filter and
removal of any rows with missing predictor values, 3,069
counties survive into the modeling sample.
The BDS and the ACS are widely used in academic and policy research and are the most authoritative federal sources for, respectively, business dynamics and county-level demographics. Both have known limitations relevant to this analysis:
U.S. Census Bureau. (2024a). Business Dynamics Statistics, county time series, 2023 release [Data set]. https://www.census.gov/data/datasets/time-series/econ/bds/bds-datasets.html
U.S. Census Bureau. (2024b). American Community Survey 5-Year Estimates (2019 to 2023), county level [Data set]. Retrieved via the Census Data API at https://api.census.gov/data/2023/acs/acs5
This section examines the modeling table from three angles: how the target itself is distributed, how each predictor is distributed, and how each predictor relates to the target. The 3,069 counties in the modeling sample give us enough data to read each pattern visually with confidence.
The startup-rate distribution is right-skewed. Most counties cluster between roughly 1 and 3 new establishments per 1,000 residents, with the modal county sitting near 1.5. A long thin tail extends past 10, accounting for the small number of counties with disproportionately high entrepreneurial activity. The mean (about 2.0) is pulled noticeably above the median (about 1.85) by that tail. Practically, this skew suggests that a relatively small fraction of counties drive a meaningful share of total business formation, and it raises a possible modeling decision (whether to log-transform the target) that we revisit in the modeling section.
Three predictors are visibly right-skewed
(median_income, pct_bachelors_or_higher,
pct_foreign_born), which is consistent with the United
States having a concentrated set of high-income, highly educated, and
immigrant-rich metropolitan areas alongside a much larger mass of more
typical counties. Three are roughly symmetric (lfp_rate,
mean_commute, pct_25_44), suggesting that
labor force participation, commute times, and prime-age population share
vary across counties without the long thin tail that characterizes
income or education. The skewness in income, education, and foreign-born
share matters for modeling: linear methods can be sensitive to skewed
predictors, while tree-based methods are largely robust to it.
The LOESS curves give us a first read on each relationship before any
modeling assumptions are imposed. The strongest signal visible to the
eye is a positive, near-linear trend with
pct_bachelors_or_higher: counties at the high end of
education show predicted startup rates roughly three times those at the
low end. median_income shows a weaker but still positive
trend across its full range. pct_foreign_born is mostly
flat with a small upward lift at very high values (counties with large
immigrant populations). mean_commute is slightly negative
or flat, which is counterintuitive if we expected denser metropolitan
areas to drive entrepreneurship; whether that effect survives once we
control for income and education is a question for the multivariate
models. The two most interesting shapes are lfp_rate and
pct_25_44, both of which sit on a flat plateau through most
of their range and then curve sharply upward only at the very high end;
counties with unusually high labor-force participation or unusually
large prime-age populations show meaningfully higher startup rates,
while typical counties show no clear trend.
These are univariate views. The same predictor can look influential
here and turn out redundant once we control for others, or the reverse.
The nonlinear shapes in lfp_rate and pct_25_44
are also exactly the kind of pattern that a linear LASSO will smooth out
but a Random Forest will pick up; the comparison between models in the
next section is where this becomes visible.
Mapping the target onto U.S. counties reveals a regional pattern that is harder to see in the histograms. Higher startup rates concentrate visibly in the Sun Belt (Florida especially) and across scattered counties of the Mountain West, while the most extreme individual values often appear in sparsely populated rural counties where a small population denominator inflates the per-capita rate. The Midwestern farm belt and much of the Northeast show consistently lower rates. Geographic position is not in our predictor set, which means our models must explain this regional pattern entirely through the demographic and economic variables: population age structure, education, immigration, income, commuting, and labor force participation. What the predictors cannot capture is taken up in the limitations section.
I fit three regression models on the same 80/20 train-test split, with all hyperparameter tuning performed by 10-fold cross-validation on the training portion only and final accuracy assessed on the held-out 20% test set. The three models were chosen to span three different modeling philosophies so that comparing them surfaces different aspects of the predictor-target relationship rather than just three flavors of the same idea:
All three were given the same 6 predictors and the same outcome.
LASSO’s regularization parameter \(\lambda\), RF’s mtry, and
kNN’s \(k\) were tuned by
cross-validation, and the chosen values were \(\lambda = 0.013\), \(\text{mtry} = 2\), and \(k = 30\).
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| LASSO | 0.8964 | 0.2584 | 0.5322 |
| Random Forest | 0.8792 | 0.2827 | 0.5084 |
| kNN | 0.8780 | 0.2980 | 0.4922 |
All three models produced similar test-set accuracy. The kNN regressor edged out the other two on every metric (RMSE 0.878, R-squared 0.298, MAE 0.492); Random Forest came second by small margins, and LASSO trailed both. The differences between models, however, are smaller than the difference between any one of them and a simple baseline of “always predict the overall mean”: the standard deviation of the target is 0.95, so a constant-mean predictor would post RMSE 0.95 and R-squared 0. All three models therefore add real signal over an uninformative baseline.
That the three modeling families converge to similar accuracy is itself a finding. It tells us that most of the signal in this dataset is already captured by linear contributions of the six predictors, with only modest additional gains from nonlinearity. Random Forest’s flexibility helps a little; kNN’s smooth local averaging helps a little more. But the bulk of predictive value sits in the strong, near-linear relationship between education and startup rate that even LASSO captures.
The held-out test RMSE of approximately 0.88 means the models’ typical prediction error is about 0.88 new establishments per 1,000 residents. To translate that into a unit a county economic-development office would care about: a typical county in our dataset has roughly 60,000 residents, so an error of 0.88 per 1,000 residents corresponds to about 53 establishments missed in a typical year. For benchmarking, a subject-matter expert (e.g., an SBA grant analyst or a state-level economic-development planner) familiar with their region might predict county-level startup activity with a similar error band purely from local knowledge of major metros versus rural counties; the models are not meaningfully better than an informed expert at calling individual counties, but they offer the advantage of consistent error properties across all 3,000 U.S. counties simultaneously, where a human expert’s intuition only generalizes to areas they know.
For policy use cases, this error magnitude is enough to rank-order counties (the top 10% of predicted counties versus the bottom 10% are well-separated), but not enough to make fine distinctions among adjacent counties (a county predicted at 1.8 and another at 2.0 are statistically indistinguishable given our typical error). That asymmetry matters: the models are useful for triage (which counties merit closer attention) but not for individual scoring of any one county.
Across the three models, pct_bachelors_or_higher
is the single most important predictor. Random Forest’s
variable-importance plot put it at roughly 100 (normalized), with the
second-place predictor at about 35. LASSO assigned it the largest
IQR-weighted impact among the six predictors. kNN’s
25th-to-75th-percentile impact also placed it first. The relationship is
consistently positive and roughly linear over most of its range, with a
saturation effect above 60% bachelor’s-or-higher (Random Forest’s
partial-dependence curve flattens above this threshold, suggesting that
beyond a certain education density, additional college graduates do not
translate into proportionally more new firms).
Three nonlinearities surface only when we leave LASSO’s linearity assumption behind:
mean_commute has a negative
relationship with startup rate. RF’s partial-dependence plot shows a
sharp drop from about 3.3 at very short commute times (5 to 10 minutes)
down to about 2.0 by the 25-minute mark, then flat. This contradicted my
prior expectation that denser metros should foster more
entrepreneurship. The likely explanation: short-commute counties tend to
be small-town and rural (where everyone works close to home), and these
are precisely the counties whose per-capita rates are noisily inflated
by small population denominators. With more years of data this effect
would likely attenuate.pct_25_44 shows a U-shape: counties
with low or high shares of prime-age adults have the highest startup
rates, with a dip in the middle. LASSO smoothed this away (linear
coefficient -0.023, near zero); Random Forest preserved it. The U-shape
is consistent with two different kinds of entrepreneurial activity: very
young counties (university towns) and very old or family-stable counties
at one end versus mid-prime-age suburban counties at the other.lfp_rate has a J-shape (flat through
most of its range, sharply upward only at the very high end). LASSO
shrunk this predictor to exactly zero (did not select it); Random Forest
assigned it importance ~20 (top half). The high-LFP tail likely captures
a small set of high-employment, high-engagement counties that are
statistically distinct from the rest.These nonlinearities are exactly the part of the predictor-outcome relationship that linear methods cannot capture, and they are visible in Sections 5 and 6 (Random Forest partial-dependence and kNN marginal-effect plots) of the accompanying analysis workbook.
To make the modeling concrete, four counties were selected to span the range of profiles in the data (a programmatically chosen median-demographics county, plus three named urban, rural, and tech-metro counties). Predictions were generated from all three fitted models:
| County | Actual | LASSO | RF | kNN |
|---|---|---|---|---|
| Morgan County, AL (median demographic) | 1.62 | 2.01 | 1.76 | 1.85 |
| Cook County, IL (Chicago metro) | 2.31 | 2.45 | 2.57 | 2.66 |
| Stark County, IL (rural) | 2.82 | 1.82 | 1.58 | 1.62 |
| Travis County, TX (Austin) | 3.99 | 2.99 | 3.84 | 3.41 |
Three observations from this table. First, all three models agree closely for typical counties (Morgan, Cook). The “three modeling philosophies” disagreement is concentrated in the tails. Second, Random Forest substantially outperforms LASSO and kNN on Travis County (Austin), the highest-rate county in our sample (RF predicts 3.84 against actual 3.99, an error of just 0.15; LASSO predicts 2.99, a full 1.0 below actual). This is exactly the high-end nonlinearity scenario where a tree-based model’s flexibility pays off. Third, all three models substantially under-predict Stark County’s actual rate of 2.82 (predictions range 1.58 to 1.82). This is the small-population-denominator artifact: Stark County’s high per-capita rate reflects a small handful of new establishments in a small county, and no demographics-only model can be expected to predict that.
In terms of expected accuracy for any given county, the test-set RMSE of 0.88 implies that any single prediction has roughly a 1-standard-deviation band of \(\pm 0.88\) around it. A user querying the model for, say, Cook County (predicted 2.5 to 2.7 by the three models) should treat the answer as “approximately 2 to 3 startups per 1,000” rather than as a point estimate.
The analysis above produces a usable but imperfect picture of county-level startup formation. Several limitations are worth naming.
Spatial autocorrelation. Counties next to each other share labor markets, supply chains, and customer bases. Every model in this project treats each county as statistically independent, which is conservative but understates the true joint information in the data. A spatial random-effects model (geographically weighted regression, or an ICAR random effect at the state or commuting-zone level) would be the principled fix and would likely improve test-set accuracy.
Single-year cross-section. Modeling the year 2023 alone gives us a snapshot, not a trajectory. A county on a long-term boom looks identical to a county at the peak of a one-year spike. Pooling several years of BDS and ACS into a panel (2018 through 2023, for example) would let us model county fixed effects, and would dampen the small-population-county noise that hurt the Stark County prediction.
Heavy-tail compression. All three models in this project compress predictions toward the mean. The Travis County example shows the closest any model came (Random Forest at 3.84 against actual 3.99), but the dataset’s rare extreme counties are inherently hard to predict from a six-predictor demographic snapshot. The handful of counties in the dataset with startup rates above 8 per 1,000 simply do not have enough company in predictor space to be learned reliably.
Predictor scope. Six predictors is conservative. Plausible additions, all available from primary public sources, include SBA 7(a) loan volume by county (a direct measure of access to capital), FCC broadband download speed by census tract (more granular than ACS adoption rates), state-level tax climate, and a categorical state or region effect. Each could plug into the same modeling pipeline.
Modeling alternatives. A generalized additive model
(GAM) with smoothers on each predictor would produce smoother, more
interpretable nonlinearities than Random Forest while still capturing
the J-shape and U-shape relationships that LASSO smoothed away. Boosted
trees (gbm) would likely improve raw RMSE further at the
cost of more tuning. Both extensions would be straightforward to add to
the existing workbook.
Interpretation caveat. The strongest single relationship in the data is between bachelor’s-degree share and startup rate, but this should not be read as causal. Counties with high education shares also have higher incomes, more young adults, more foreign-born residents, and denser metropolitan structures; the model can identify the correlate but not isolate which specific channel drives the effect. A causal estimate would require a research design (instrumental variables, difference-in-differences) outside the scope of this project.