skills /samplics-survey-analysis
Python Referenced

samplics-survey-analysis

Academic-grade survey data analysis using samplics for complex sample designs. Use for computing weighted means, medians, proportions with proper standard errors, categorical data analysis, tabulation by demographics and time, and integration with statsmodels for regression analysis. Supports stratified, clustered, and multi-stage sampling designs.

Survey Data Analysis with Samplics

Professional survey data analysis combining proper sampling design estimation with statistical modeling.

When to Use This Skill

  • Complex survey data with sampling weights, strata, and/or clusters
  • Descriptive statistics (means, medians, proportions, totals) with proper standard errors
  • Tabulation and cross-tabulation by demographics, time periods, or multiple factors
  • Categorical data analysis with weighted counts and proportions
  • Time series analysis of survey data (trends over time)
  • Subgroup analysis with proper variance estimation
  • Regression modeling that accounts for survey design
  • Publication-ready tables and visualizations

Prerequisites

Required packages:

hljs bash
pip install samplics pandas numpy matplotlib --break-system-packages # For regression integration: pip install statsmodels --break-system-packages

Python version: 3.8+

Data requirements:

  • Survey weight variable (required)
  • Stratum variable (optional, for stratified designs)
  • Cluster/PSU variable (optional, for clustered designs)

Bundled Resources

References (references/)

Load these when implementing survey analyses or needing detailed guidance:

  • descriptive-statistics.md - Comprehensive guide to means, medians, proportions, totals
  • tabulation-guide.md - Cross-tabulation and categorical data analysis patterns
  • regression-integration.md - Integrating samplics with statsmodels for weighted regression
  • design-effects.md - Understanding and computing design effects (DEFF, DEFT)

Scripts (scripts/)

Executable utilities for common survey analysis tasks:

  • descriptive_pipeline.py - Complete descriptive analysis workflow
  • tabulation_pipeline.py - Automated tabulation and cross-tabulation
  • trend_analysis.py - Time series analysis of survey data

Assets (assets/)

Production-ready analysis templates and examples:

  • nhanes_example.py - Complete analysis using simulated NHANES-style data
  • demographic_analysis.py - Subgroup analysis template
  • trend_visualization.py - Time series plotting patterns

Core Workflow

hljs python
# 1. Install packages pip install samplics pandas --break-system-packages # 2. Import and prepare data import pandas as pd from samplics.estimation import TaylorEstimator df = pd.read_csv("survey_data.csv") # 3. Create estimator with design estimator = TaylorEstimator("mean") # 4. Compute estimates with proper SE estimate = estimator.estimate( y=df["variable"].values, samp_weight=df["weight"].values, stratum=df["stratum"].values, psu=df["cluster"].values, remove_nan=True ) # 5. Access results print(f"Estimate: {estimate.point_est}") print(f"SE: {estimate.stderror}") print(f"95% CI: [{estimate.lower_ci}, {estimate.upper_ci}]")

Quick Start Examples

Example 1: Simple Mean with Standard Error

hljs python
import pandas as pd import numpy as np from samplics.estimation import TaylorEstimator # Sample data df = pd.DataFrame({ 'income': np.random.lognormal(10, 1, 1000), 'weight': np.random.uniform(0.5, 2, 1000), 'stratum': np.random.choice(['A', 'B', 'C'], 1000), 'cluster': np.repeat(range(50), 20) }) # Create estimator estimator = TaylorEstimator("mean") # Compute weighted mean with design-based SE result = estimator.estimate( y=df['income'].values, samp_weight=df['weight'].values, stratum=df['stratum'].values, psu=df['cluster'].values, remove_nan=True ) print(f"Mean income: ${result.point_est:,.0f}") print(f"SE: ${result.stderror:,.0f}") print(f"95% CI: [${result.lower_ci:,.0f}, ${result.upper_ci:,.0f}]")

Example 2: Proportions by Demographics

hljs python
from samplics.estimation import TaylorEstimator # Proportion estimator prop_estimator = TaylorEstimator("proportion") # Compute proportion by gender for gender in df['gender'].unique(): subset = df['gender'] == gender result = prop_estimator.estimate( y=df.loc[subset, 'has_insurance'].values, samp_weight=df.loc[subset, 'weight'].values, stratum=df.loc[subset, 'stratum'].values, psu=df.loc[subset, 'cluster'].values, remove_nan=True ) print(f"{gender}: {result.point_est*100:.1f}% ± {result.stderror*100:.1f}%")

Example 3: Cross-Tabulation

hljs python
# Tabulate by two factors results = {} for race in df['race'].unique(): results[race] = {} for education in df['education'].unique(): subset = (df['race'] == race) & (df['education'] == education) if subset.sum() > 0: result = prop_estimator.estimate( y=df.loc[subset, 'outcome'].values, samp_weight=df.loc[subset, 'weight'].values, stratum=df.loc[subset, 'stratum'].values, psu=df.loc[subset, 'cluster'].values, remove_nan=True ) results[race][education] = { 'estimate': result.point_est, 'se': result.stderror } # Convert to DataFrame for display import pandas as pd table = pd.DataFrame(results).T print(table)

Key Patterns

Pattern 1: Descriptive Analysis Pipeline

hljs python
import pandas as pd import numpy as np from samplics.estimation import TaylorEstimator def descriptive_analysis(df, variables, weight_var, stratum_var=None, cluster_var=None, by_var=None): """ Complete descriptive analysis for survey data. Parameters: ----------- df : DataFrame Survey data variables : list of str Variables to analyze weight_var : str Sampling weight variable stratum_var : str, optional Stratum variable cluster_var : str, optional Cluster/PSU variable by_var : str, optional Grouping variable Returns: -------- DataFrame with estimates and standard errors """ results = [] # Determine groups if by_var: groups = df[by_var].unique() else: groups = [None] # Loop through variables and groups for var in variables: for group in groups: # Subset data if group is not None: subset = df[df[by_var] == group] group_label = f"{by_var}={group}" else: subset = df group_label = "Overall" # Skip if empty if len(subset) == 0: continue # Prepare arrays y = subset[var].values weights = subset[weight_var].values stratum = subset[stratum_var].values if stratum_var else None psu = subset[cluster_var].values if cluster_var else None # Mean est_mean = TaylorEstimator("mean") result_mean = est_mean.estimate( y=y, samp_weight=weights, stratum=stratum, psu=psu, remove_nan=True ) # Median (using quantile estimator) est_median = TaylorEstimator("quantile") result_median = est_median.estimate( y=y, samp_weight=weights, stratum=stratum, psu=psu, q=0.5, remove_nan=True ) # Total est_total = TaylorEstimator("total") result_total = est_total.estimate( y=y, samp_weight=weights, stratum=stratum, psu=psu, remove_nan=True ) results.append({ 'Variable': var, 'Group': group_label, 'Mean': result_mean.point_est, 'Mean_SE': result_mean.stderror, 'Median': result_median.point_est, 'Total': result_total.point_est, 'Total_SE': result_total.stderror, 'N': len(y[~np.isnan(y)]) }) return pd.DataFrame(results) # Use it results_df = descriptive_analysis( df=survey_df, variables=['income', 'age', 'bmi'], weight_var='weight', stratum_var='stratum', cluster_var='cluster', by_var='gender' ) print(results_df)

Pattern 2: Time Series Analysis

hljs python
def trend_analysis(df, variable, time_var, weight_var, stratum_var=None, cluster_var=None): """ Analyze trends over time in survey data. """ results = [] for time_point in sorted(df[time_var].unique()): subset = df[df[time_var] == time_point] estimator = TaylorEstimator("mean") result = estimator.estimate( y=subset[variable].values, samp_weight=subset[weight_var].values, stratum=subset[stratum_var].values if stratum_var else None, psu=subset[cluster_var].values if cluster_var else None, remove_nan=True ) results.append({ 'Time': time_point, 'Estimate': result.point_est, 'SE': result.stderror, 'Lower_CI': result.lower_ci, 'Upper_CI': result.upper_ci }) return pd.DataFrame(results) # Use it trends = trend_analysis( df=survey_df, variable='obesity_rate', time_var='year', weight_var='weight', stratum_var='stratum', cluster_var='cluster' ) # Plot import matplotlib.pyplot as plt plt.figure(figsize=(10, 6)) plt.plot(trends['Time'], trends['Estimate'], marker='o', linewidth=2) plt.fill_between( trends['Time'], trends['Lower_CI'], trends['Upper_CI'], alpha=0.3 ) plt.xlabel('Year') plt.ylabel('Obesity Rate') plt.title('Trend in Obesity Rate Over Time') plt.grid(True, alpha=0.3) plt.savefig('trend_plot.png', dpi=300, bbox_inches='tight')

Pattern 3: Regression with Survey Weights

hljs python
import statsmodels.api as sm from statsmodels.regression.linear_model import WLS def survey_regression(df, formula, weight_var, cluster_var=None): """ Weighted regression accounting for survey design. Parameters: ----------- df : DataFrame Survey data formula : str Regression formula (patsy style) weight_var : str Sampling weight variable cluster_var : str, optional Cluster variable for robust SE Returns: -------- Regression results """ # Fit weighted regression model = sm.formula.wls( formula=formula, data=df, weights=df[weight_var] ) # Use robust SE if clusters provided if cluster_var: results = model.fit(cov_type='cluster', cov_kwds={'groups': df[cluster_var]}) else: results = model.fit(cov_type='HC3') return results # Use it model = survey_regression( df=survey_df, formula='income ~ age + education + gender', weight_var='weight', cluster_var='cluster' ) print(model.summary()) # Marginal effects (if marginaleffects installed) try: from marginaleffects import avg_slopes ame = avg_slopes(model) print("\nAverage Marginal Effects:") print(ame) except ImportError: print("\nInstall marginaleffects for marginal effects analysis")

Critical Best Practices

Data Preparation

✓ Always check for missing values:

hljs python
df = df.dropna(subset=['variable', 'weight'])

✓ Convert to numpy arrays:

hljs python
y = df['variable'].values # Not df['variable']

✓ Ensure positive weights:

hljs python
assert (df['weight'] > 0).all(), "All weights must be positive"

Variance Estimation

✓ Always use proper design information:

hljs python
# Include stratum and cluster when available result = estimator.estimate( y=y, samp_weight=weights, stratum=stratum, psu=psu # Don't omit these! )

✓ Set remove_nan=True for automatic missing data handling:

hljs python
result = estimator.estimate(..., remove_nan=True)

✓ Use appropriate confidence level:

hljs python
result = estimator.estimate(..., alpha=0.05) # Default is 95% CI

Reporting

✓ Always report point estimates AND uncertainty:

hljs python
print(f"{result.point_est:.2f} ± {result.stderror:.2f}") print(f"95% CI: [{result.lower_ci:.2f}, {result.upper_ci:.2f}]")

✓ Report design effects when relevant:

hljs python
deff = result.stderror**2 / (result.point_est * (1-result.point_est) / n) print(f"Design Effect: {deff:.2f}")

✓ Include sample sizes:

hljs python
print(f"N = {len(y[~np.isnan(y)])}")

Common Pitfalls

Forgetting to use .values

hljs python
# Bad - passes Series result = estimator.estimate(y=df['var'], ...) # Good - passes array result = estimator.estimate(y=df['var'].values, ...)

Ignoring survey design

hljs python
# Bad - no design info result = estimator.estimate(y=y, samp_weight=weights) # Good - includes stratum and cluster result = estimator.estimate( y=y, samp_weight=weights, stratum=stratum, psu=psu )

Using unweighted statistics

hljs python
# Bad - simple mean mean = df['income'].mean() # Good - weighted with proper SE estimator = TaylorEstimator("mean") result = estimator.estimate(y=df['income'].values, samp_weight=df['weight'].values)

Not reporting uncertainty

hljs python
# Bad - point estimate only print(f"Mean: {result.point_est}") # Good - with confidence interval print(f"Mean: {result.point_est:.2f} (95% CI: [{result.lower_ci:.2f}, {result.upper_ci:.2f}])")

Available Estimators

TaylorEstimator Parameters

ParameterOptionsDescription
parameter"mean", "proportion", "total", "ratio", "quantile"Quantity to estimate
alpha0.05 (default)Significance level for CI
remove_nanTrue/FalseAutomatically handle missing
single_psu"error", "adjust", "certainty"Single PSU handling

Estimation Methods

MethodPurposeExample
estimate()Point estimate + varianceestimator.estimate(y, samp_weight)
point_estAccess point estimateresult.point_est
stderrorAccess standard errorresult.stderror
lower_ciLower confidence limitresult.lower_ci
upper_ciUpper confidence limitresult.upper_ci

Visualization Patterns

Error Bar Plots

hljs python
import matplotlib.pyplot as plt def plot_estimates(results_df, x_col, y_col, se_col, title="Survey Estimates", figsize=(10, 6)): """ Create error bar plot for survey estimates. """ fig, ax = plt.subplots(figsize=figsize) ax.errorbar( results_df[x_col], results_df[y_col], yerr=1.96 * results_df[se_col], # 95% CI marker='o', linestyle='-', linewidth=2, capsize=5, capthick=2 ) ax.set_xlabel(x_col, fontsize=12) ax.set_ylabel(y_col, fontsize=12) ax.set_title(title, fontsize=14) ax.grid(True, alpha=0.3) # Rotate x-labels if many categories if len(results_df) > 5: plt.xticks(rotation=45, ha='right') plt.tight_layout() return fig # Use it fig = plot_estimates( results_df=trends, x_col='Year', y_col='Estimate', se_col='SE', title='Health Indicator Over Time' ) plt.savefig('estimates_plot.png', dpi=300, bbox_inches='tight')

Grouped Bar Plots

hljs python
def plot_grouped_estimates(results_df, group_col, category_col, estimate_col, se_col): """ Create grouped bar plot with error bars. """ import numpy as np categories = results_df[category_col].unique() groups = results_df[group_col].unique() x = np.arange(len(categories)) width = 0.8 / len(groups) fig, ax = plt.subplots(figsize=(12, 6)) for i, group in enumerate(groups): subset = results_df[results_df[group_col] == group] ax.bar( x + i * width, subset[estimate_col], width, yerr=1.96 * subset[se_col], label=group, capsize=5 ) ax.set_xlabel(category_col, fontsize=12) ax.set_ylabel(estimate_col, fontsize=12) ax.set_xticks(x + width * (len(groups) - 1) / 2) ax.set_xticklabels(categories, rotation=45, ha='right') ax.legend() ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() return fig

Integration with Statsmodels

hljs python
import statsmodels.api as sm from samplics.estimation import TaylorEstimator # 1. Fit weighted regression model = sm.formula.wls( 'outcome ~ age + gender + education', data=df, weights=df['weight'] ).fit(cov_type='cluster', cov_kwds={'groups': df['cluster']}) print(model.summary()) # 2. Predicted values accounting for weights df['predicted'] = model.predict(df) # 3. Compute weighted mean of predictions pred_estimator = TaylorEstimator("mean") pred_result = pred_estimator.estimate( y=df['predicted'].values, samp_weight=df['weight'].values, stratum=df['stratum'].values, psu=df['cluster'].values ) print(f"Mean predicted outcome: {pred_result.point_est:.2f}") # 4. Model diagnostics with weights weighted_residuals = model.resid * np.sqrt(df['weight'])

Advanced Topics

See detailed guides in references/:

Troubleshooting

Issue: "ValueError: could not broadcast input array"

Cause: Passing pandas Series instead of numpy arrays
Fix: Use .values when passing to estimator

hljs python
y = df['variable'].values # Add .values

Issue: Large standard errors

Cause: Small sample size or high design effect
Fix:

  1. Check sample size: print(len(y[~np.isnan(y)]))
  2. Compute design effect to understand inflation
  3. Consider collapsing categories or pooling data

Issue: "All values are NaN"

Cause: Missing data in variable
Fix:

hljs python
# Check missing data print(df['variable'].isna().sum()) # Use remove_nan=True result = estimator.estimate(..., remove_nan=True)

Issue: "Single PSU in stratum"

Cause: Only one cluster in a stratum
Fix: Use single_psu adjustment

hljs python
result = estimator.estimate( ..., single_psu={"method": "adjust"} )

Function Reference

Main Estimators

hljs python
from samplics.estimation import TaylorEstimator # Create estimator est = TaylorEstimator(parameter) # parameter: "mean", "proportion", "total", "ratio", "quantile" # Estimate result = est.estimate( y, # Outcome variable (array) samp_weight, # Sampling weights (array) stratum=None, # Stratum IDs (array, optional) psu=None, # Cluster IDs (array, optional) ssu=None, # Secondary sampling units (array, optional) remove_nan=True, # Remove missing values alpha=0.05, # Significance level single_psu=None # Single PSU handling ) # Access results result.point_est # Point estimate result.stderror # Standard error result.lower_ci # Lower CI bound result.upper_ci # Upper CI bound

Additional Functions

hljs python
# Replicate weights (for alternative variance estimation) from samplics.estimation import ReplicateEstimator rep_est = ReplicateEstimator( parameter="mean", method="bootstrap" # or "jackknife", "BRR" )

Resources

  • Survey sampling theory: Lohr (2019) "Sampling: Design and Analysis"
  • Complex surveys in practice: Lumley (2010) "Complex Surveys"
  • Scripts: Pre-built workflows in scripts/
  • Examples: Complete analyses in assets/

Installation Check

Run this to verify setup:

hljs python
import samplics import pandas as pd import numpy as np print(f"samplics version: {samplics.__version__}") print(f"pandas version: {pd.__version__}") print("Setup complete!")

Getting Help

  1. Check function docstrings: help(TaylorEstimator)
  2. Review examples: assets/nhanes_example.py
  3. Consult references: references/descriptive-statistics.md
  4. Search docs: https://samplics-org.github.io/samplics/

Related Categories