Survey Data Analysis with Samplics
Professional survey data analysis combining proper sampling design estimation with statistical modeling.
Complex survey data with sampling weights, strata, and/or clusters
Descriptive statistics (means, medians, proportions, totals) with proper standard errors
Tabulation and cross-tabulation by demographics, time periods, or multiple factors
Categorical data analysis with weighted counts and proportions
Time series analysis of survey data (trends over time)
Subgroup analysis with proper variance estimation
Regression modeling that accounts for survey design
Publication-ready tables and visualizations
Required packages:
pip install samplics pandas numpy matplotlib --break-system-packages
pip install statsmodels --break-system-packages
Python version: 3.8+
Data requirements:
Survey weight variable (required)
Stratum variable (optional, for stratified designs)
Cluster/PSU variable (optional, for clustered designs)
References (references/)
Load these when implementing survey analyses or needing detailed guidance:
descriptive-statistics.md - Comprehensive guide to means, medians, proportions, totals
tabulation-guide.md - Cross-tabulation and categorical data analysis patterns
regression-integration.md - Integrating samplics with statsmodels for weighted regression
design-effects.md - Understanding and computing design effects (DEFF, DEFT)
Scripts (scripts/)
Executable utilities for common survey analysis tasks:
descriptive_pipeline.py - Complete descriptive analysis workflow
tabulation_pipeline.py - Automated tabulation and cross-tabulation
trend_analysis.py - Time series analysis of survey data
Assets (assets/)
Production-ready analysis templates and examples:
nhanes_example.py - Complete analysis using simulated NHANES-style data
demographic_analysis.py - Subgroup analysis template
trend_visualization.py - Time series plotting patterns
pip install samplics pandas --break -system-packages
import pandas as pd
from samplics.estimation import TaylorEstimator
df = pd.read_csv("survey_data.csv" )
estimator = TaylorEstimator("mean" )
estimate = estimator.estimate(
y=df["variable" ].values,
samp_weight=df["weight" ].values,
stratum=df["stratum" ].values,
psu=df["cluster" ].values,
remove_nan=True
)
print (f"Estimate: {estimate.point_est} " )
print (f"SE: {estimate.stderror} " )
print (f"95% CI: [{estimate.lower_ci} , {estimate.upper_ci} ]" )
Example 1: Simple Mean with Standard Error
import pandas as pd
import numpy as np
from samplics.estimation import TaylorEstimator
df = pd.DataFrame({
'income' : np.random.lognormal(10 , 1 , 1000 ),
'weight' : np.random.uniform(0.5 , 2 , 1000 ),
'stratum' : np.random.choice(['A' , 'B' , 'C' ], 1000 ),
'cluster' : np.repeat(range (50 ), 20 )
})
estimator = TaylorEstimator("mean" )
result = estimator.estimate(
y=df['income' ].values,
samp_weight=df['weight' ].values,
stratum=df['stratum' ].values,
psu=df['cluster' ].values,
remove_nan=True
)
print (f"Mean income: ${result.point_est:,.0 f} " )
print (f"SE: ${result.stderror:,.0 f} " )
print (f"95% CI: [${result.lower_ci:,.0 f} , ${result.upper_ci:,.0 f} ]" )
Example 2: Proportions by Demographics
from samplics.estimation import TaylorEstimator
prop_estimator = TaylorEstimator("proportion" )
for gender in df['gender' ].unique():
subset = df['gender' ] == gender
result = prop_estimator.estimate(
y=df.loc[subset, 'has_insurance' ].values,
samp_weight=df.loc[subset, 'weight' ].values,
stratum=df.loc[subset, 'stratum' ].values,
psu=df.loc[subset, 'cluster' ].values,
remove_nan=True
)
print (f"{gender} : {result.point_est*100 :.1 f} % ± {result.stderror*100 :.1 f} %" )
Example 3: Cross-Tabulation
results = {}
for race in df['race' ].unique():
results[race] = {}
for education in df['education' ].unique():
subset = (df['race' ] == race) & (df['education' ] == education)
if subset.sum () > 0 :
result = prop_estimator.estimate(
y=df.loc[subset, 'outcome' ].values,
samp_weight=df.loc[subset, 'weight' ].values,
stratum=df.loc[subset, 'stratum' ].values,
psu=df.loc[subset, 'cluster' ].values,
remove_nan=True
)
results[race][education] = {
'estimate' : result.point_est,
'se' : result.stderror
}
import pandas as pd
table = pd.DataFrame(results).T
print (table)
Pattern 1: Descriptive Analysis Pipeline
import pandas as pd
import numpy as np
from samplics.estimation import TaylorEstimator
def descriptive_analysis (df, variables, weight_var, stratum_var=None ,
cluster_var=None , by_var=None ):
"""
Complete descriptive analysis for survey data.
Parameters:
-----------
df : DataFrame
Survey data
variables : list of str
Variables to analyze
weight_var : str
Sampling weight variable
stratum_var : str, optional
Stratum variable
cluster_var : str, optional
Cluster/PSU variable
by_var : str, optional
Grouping variable
Returns:
--------
DataFrame with estimates and standard errors
"""
results = []
if by_var:
groups = df[by_var].unique()
else :
groups = [None ]
for var in variables:
for group in groups:
if group is not None :
subset = df[df[by_var] == group]
group_label = f"{by_var} ={group} "
else :
subset = df
group_label = "Overall"
if len (subset) == 0 :
continue
y = subset[var].values
weights = subset[weight_var].values
stratum = subset[stratum_var].values if stratum_var else None
psu = subset[cluster_var].values if cluster_var else None
est_mean = TaylorEstimator("mean" )
result_mean = est_mean.estimate(
y=y, samp_weight=weights,
stratum=stratum, psu=psu, remove_nan=True
)
est_median = TaylorEstimator("quantile" )
result_median = est_median.estimate(
y=y, samp_weight=weights,
stratum=stratum, psu=psu,
q=0.5 , remove_nan=True
)
est_total = TaylorEstimator("total" )
result_total = est_total.estimate(
y=y, samp_weight=weights,
stratum=stratum, psu=psu, remove_nan=True
)
results.append({
'Variable' : var,
'Group' : group_label,
'Mean' : result_mean.point_est,
'Mean_SE' : result_mean.stderror,
'Median' : result_median.point_est,
'Total' : result_total.point_est,
'Total_SE' : result_total.stderror,
'N' : len (y[~np.isnan(y)])
})
return pd.DataFrame(results)
results_df = descriptive_analysis(
df=survey_df,
variables=['income' , 'age' , 'bmi' ],
weight_var='weight' ,
stratum_var='stratum' ,
cluster_var='cluster' ,
by_var='gender'
)
print (results_df)
Pattern 2: Time Series Analysis
def trend_analysis (df, variable, time_var, weight_var,
stratum_var=None , cluster_var=None ):
"""
Analyze trends over time in survey data.
"""
results = []
for time_point in sorted (df[time_var].unique()):
subset = df[df[time_var] == time_point]
estimator = TaylorEstimator("mean" )
result = estimator.estimate(
y=subset[variable].values,
samp_weight=subset[weight_var].values,
stratum=subset[stratum_var].values if stratum_var else None ,
psu=subset[cluster_var].values if cluster_var else None ,
remove_nan=True
)
results.append({
'Time' : time_point,
'Estimate' : result.point_est,
'SE' : result.stderror,
'Lower_CI' : result.lower_ci,
'Upper_CI' : result.upper_ci
})
return pd.DataFrame(results)
trends = trend_analysis(
df=survey_df,
variable='obesity_rate' ,
time_var='year' ,
weight_var='weight' ,
stratum_var='stratum' ,
cluster_var='cluster'
)
import matplotlib.pyplot as plt
plt.figure(figsize=(10 , 6 ))
plt.plot(trends['Time' ], trends['Estimate' ], marker='o' , linewidth=2 )
plt.fill_between(
trends['Time' ],
trends['Lower_CI' ],
trends['Upper_CI' ],
alpha=0.3
)
plt.xlabel('Year' )
plt.ylabel('Obesity Rate' )
plt.title('Trend in Obesity Rate Over Time' )
plt.grid(True , alpha=0.3 )
plt.savefig('trend_plot.png' , dpi=300 , bbox_inches='tight' )
Pattern 3: Regression with Survey Weights
import statsmodels.api as sm
from statsmodels.regression.linear_model import WLS
def survey_regression (df, formula, weight_var, cluster_var=None ):
"""
Weighted regression accounting for survey design.
Parameters:
-----------
df : DataFrame
Survey data
formula : str
Regression formula (patsy style)
weight_var : str
Sampling weight variable
cluster_var : str, optional
Cluster variable for robust SE
Returns:
--------
Regression results
"""
model = sm.formula.wls(
formula=formula,
data=df,
weights=df[weight_var]
)
if cluster_var:
results = model.fit(cov_type='cluster' ,
cov_kwds={'groups' : df[cluster_var]})
else :
results = model.fit(cov_type='HC3' )
return results
model = survey_regression(
df=survey_df,
formula='income ~ age + education + gender' ,
weight_var='weight' ,
cluster_var='cluster'
)
print (model.summary())
try :
from marginaleffects import avg_slopes
ame = avg_slopes(model)
print ("\nAverage Marginal Effects:" )
print (ame)
except ImportError:
print ("\nInstall marginaleffects for marginal effects analysis" )
Data Preparation
✓ Always check for missing values:
df = df.dropna(subset=['variable' , 'weight' ])
✓ Convert to numpy arrays:
y = df['variable' ].values
✓ Ensure positive weights:
assert (df['weight' ] > 0 ).all (), "All weights must be positive"
Variance Estimation
✓ Always use proper design information:
result = estimator.estimate(
y=y, samp_weight=weights,
stratum=stratum, psu=psu
)
✓ Set remove_nan=True for automatic missing data handling:
result = estimator.estimate(..., remove_nan=True )
✓ Use appropriate confidence level:
result = estimator.estimate(..., alpha=0.05 )
Reporting
✓ Always report point estimates AND uncertainty:
print (f"{result.point_est:.2 f} ± {result.stderror:.2 f} " )
print (f"95% CI: [{result.lower_ci:.2 f} , {result.upper_ci:.2 f} ]" )
✓ Report design effects when relevant:
deff = result.stderror**2 / (result.point_est * (1 -result.point_est) / n)
print (f"Design Effect: {deff:.2 f} " )
✓ Include sample sizes:
print (f"N = {len (y[~np.isnan(y)])} " )
❌ Forgetting to use .values
result = estimator.estimate(y=df['var' ], ...)
result = estimator.estimate(y=df['var' ].values, ...)
❌ Ignoring survey design
result = estimator.estimate(y=y, samp_weight=weights)
result = estimator.estimate(
y=y, samp_weight=weights,
stratum=stratum, psu=psu
)
❌ Using unweighted statistics
mean = df['income' ].mean()
estimator = TaylorEstimator("mean" )
result = estimator.estimate(y=df['income' ].values,
samp_weight=df['weight' ].values)
❌ Not reporting uncertainty
print (f"Mean: {result.point_est} " )
print (f"Mean: {result.point_est:.2 f} (95% CI: [{result.lower_ci:.2 f} , {result.upper_ci:.2 f} ])" )
TaylorEstimator Parameters
Parameter Options Description parameter"mean", "proportion", "total", "ratio", "quantile" Quantity to estimate alpha0.05 (default) Significance level for CI remove_nanTrue/False Automatically handle missing single_psu"error", "adjust", "certainty" Single PSU handling
Estimation Methods
Method Purpose Example estimate()Point estimate + variance estimator.estimate(y, samp_weight)point_estAccess point estimate result.point_eststderrorAccess standard error result.stderrorlower_ciLower confidence limit result.lower_ciupper_ciUpper confidence limit result.upper_ci
Error Bar Plots
import matplotlib.pyplot as plt
def plot_estimates (results_df, x_col, y_col, se_col,
title="Survey Estimates" , figsize=(10 , 6 ) ):
"""
Create error bar plot for survey estimates.
"""
fig, ax = plt.subplots(figsize=figsize)
ax.errorbar(
results_df[x_col],
results_df[y_col],
yerr=1.96 * results_df[se_col],
marker='o' ,
linestyle='-' ,
linewidth=2 ,
capsize=5 ,
capthick=2
)
ax.set_xlabel(x_col, fontsize=12 )
ax.set_ylabel(y_col, fontsize=12 )
ax.set_title(title, fontsize=14 )
ax.grid(True , alpha=0.3 )
if len (results_df) > 5 :
plt.xticks(rotation=45 , ha='right' )
plt.tight_layout()
return fig
fig = plot_estimates(
results_df=trends,
x_col='Year' ,
y_col='Estimate' ,
se_col='SE' ,
title='Health Indicator Over Time'
)
plt.savefig('estimates_plot.png' , dpi=300 , bbox_inches='tight' )
Grouped Bar Plots
def plot_grouped_estimates (results_df, group_col, category_col,
estimate_col, se_col ):
"""
Create grouped bar plot with error bars.
"""
import numpy as np
categories = results_df[category_col].unique()
groups = results_df[group_col].unique()
x = np.arange(len (categories))
width = 0.8 / len (groups)
fig, ax = plt.subplots(figsize=(12 , 6 ))
for i, group in enumerate (groups):
subset = results_df[results_df[group_col] == group]
ax.bar(
x + i * width,
subset[estimate_col],
width,
yerr=1.96 * subset[se_col],
label=group,
capsize=5
)
ax.set_xlabel(category_col, fontsize=12 )
ax.set_ylabel(estimate_col, fontsize=12 )
ax.set_xticks(x + width * (len (groups) - 1 ) / 2 )
ax.set_xticklabels(categories, rotation=45 , ha='right' )
ax.legend()
ax.grid(True , alpha=0.3 , axis='y' )
plt.tight_layout()
return fig
import statsmodels.api as sm
from samplics.estimation import TaylorEstimator
model = sm.formula.wls(
'outcome ~ age + gender + education' ,
data=df,
weights=df['weight' ]
).fit(cov_type='cluster' , cov_kwds={'groups' : df['cluster' ]})
print (model.summary())
df['predicted' ] = model.predict(df)
pred_estimator = TaylorEstimator("mean" )
pred_result = pred_estimator.estimate(
y=df['predicted' ].values,
samp_weight=df['weight' ].values,
stratum=df['stratum' ].values,
psu=df['cluster' ].values
)
print (f"Mean predicted outcome: {pred_result.point_est:.2 f} " )
weighted_residuals = model.resid * np.sqrt(df['weight' ])
See detailed guides in references/:
Cause : Passing pandas Series instead of numpy arrays
Fix : Use .values when passing to estimator
y = df['variable' ].values
Issue: Large standard errors
Cause : Small sample size or high design effect
Fix :
Check sample size: print(len(y[~np.isnan(y)]))
Compute design effect to understand inflation
Consider collapsing categories or pooling data
Issue: "All values are NaN"
Cause : Missing data in variable
Fix :
print (df['variable' ].isna().sum ())
result = estimator.estimate(..., remove_nan=True )
Issue: "Single PSU in stratum"
Cause : Only one cluster in a stratum
Fix : Use single_psu adjustment
result = estimator.estimate(
...,
single_psu={"method" : "adjust" }
)
Main Estimators
from samplics.estimation import TaylorEstimator
est = TaylorEstimator(parameter)
result = est.estimate(
y,
samp_weight,
stratum=None ,
psu=None ,
ssu=None ,
remove_nan=True ,
alpha=0.05 ,
single_psu=None
)
result.point_est
result.stderror
result.lower_ci
result.upper_ci
Additional Functions
from samplics.estimation import ReplicateEstimator
rep_est = ReplicateEstimator(
parameter="mean" ,
method="bootstrap"
)
Survey sampling theory : Lohr (2019) "Sampling: Design and Analysis"
Complex surveys in practice : Lumley (2010) "Complex Surveys"
Scripts : Pre-built workflows in scripts/
Examples : Complete analyses in assets/
Run this to verify setup:
import samplics
import pandas as pd
import numpy as np
print (f"samplics version: {samplics.__version__} " )
print (f"pandas version: {pd.__version__} " )
print ("Setup complete!" )
Check function docstrings: help(TaylorEstimator)
Review examples: assets/nhanes_example.py
Consult references: references/descriptive-statistics.md
Search docs: https://samplics-org.github.io/samplics/