skills /samplics-survey-analysis

Python Referenced

samplics-survey-analysis

Academic-grade survey data analysis using samplics for complex sample designs. Use for computing weighted means, medians, proportions with proper standard errors, categorical data analysis, tabulation by demographics and time, and integration with statsmodels for regression analysis. Supports stratified, clustered, and multi-stage sampling designs.

Survey Data Analysis with Samplics

Professional survey data analysis combining proper sampling design estimation with statistical modeling.

When to Use This Skill

Complex survey data with sampling weights, strata, and/or clusters

Descriptive statistics (means, medians, proportions, totals) with proper standard errors

Tabulation and cross-tabulation by demographics, time periods, or multiple factors

Categorical data analysis with weighted counts and proportions

Time series analysis of survey data (trends over time)

Subgroup analysis with proper variance estimation

Regression modeling that accounts for survey design

Publication-ready tables and visualizations

Prerequisites

Required packages:

hljs bash

pip install samplics pandas numpy matplotlib --break-system-packages
# For regression integration:
pip install statsmodels --break-system-packages

Python version: 3.8+

Data requirements:

Survey weight variable (required)

Stratum variable (optional, for stratified designs)

Cluster/PSU variable (optional, for clustered designs)

Bundled Resources

References (references/)

Load these when implementing survey analyses or needing detailed guidance:

descriptive-statistics.md - Comprehensive guide to means, medians, proportions, totals

tabulation-guide.md - Cross-tabulation and categorical data analysis patterns

regression-integration.md - Integrating samplics with statsmodels for weighted regression

design-effects.md - Understanding and computing design effects (DEFF, DEFT)

Scripts (scripts/)

Executable utilities for common survey analysis tasks:

descriptive_pipeline.py - Complete descriptive analysis workflow

tabulation_pipeline.py - Automated tabulation and cross-tabulation

trend_analysis.py - Time series analysis of survey data

Assets (assets/)

Production-ready analysis templates and examples:

nhanes_example.py - Complete analysis using simulated NHANES-style data

demographic_analysis.py - Subgroup analysis template

trend_visualization.py - Time series plotting patterns

Core Workflow

hljs python

# 1. Install packages
pip install samplics pandas --break-system-packages

# 2. Import and prepare data
import pandas as pd
from samplics.estimation import TaylorEstimator

df = pd.read_csv("survey_data.csv")

# 3. Create estimator with design
estimator = TaylorEstimator("mean")

# 4. Compute estimates with proper SE
estimate = estimator.estimate(
    y=df["variable"].values,
    samp_weight=df["weight"].values,
    stratum=df["stratum"].values,
    psu=df["cluster"].values,
    remove_nan=True
)

# 5. Access results
print(f"Estimate: {estimate.point_est}")
print(f"SE: {estimate.stderror}")
print(f"95% CI: [{estimate.lower_ci}, {estimate.upper_ci}]")

Quick Start Examples

Example 1: Simple Mean with Standard Error

hljs python

import pandas as pd
import numpy as np
from samplics.estimation import TaylorEstimator

# Sample data
df = pd.DataFrame({
    'income': np.random.lognormal(10, 1, 1000),
    'weight': np.random.uniform(0.5, 2, 1000),
    'stratum': np.random.choice(['A', 'B', 'C'], 1000),
    'cluster': np.repeat(range(50), 20)
})

# Create estimator
estimator = TaylorEstimator("mean")

# Compute weighted mean with design-based SE
result = estimator.estimate(
    y=df['income'].values,
    samp_weight=df['weight'].values,
    stratum=df['stratum'].values,
    psu=df['cluster'].values,
    remove_nan=True
)

print(f"Mean income: ${result.point_est:,.0f}")
print(f"SE: ${result.stderror:,.0f}")
print(f"95% CI: [${result.lower_ci:,.0f}, ${result.upper_ci:,.0f}]")

Example 2: Proportions by Demographics

hljs python

from samplics.estimation import TaylorEstimator

# Proportion estimator
prop_estimator = TaylorEstimator("proportion")

# Compute proportion by gender
for gender in df['gender'].unique():
    subset = df['gender'] == gender
    
    result = prop_estimator.estimate(
        y=df.loc[subset, 'has_insurance'].values,
        samp_weight=df.loc[subset, 'weight'].values,
        stratum=df.loc[subset, 'stratum'].values,
        psu=df.loc[subset, 'cluster'].values,
        remove_nan=True
    )
    
    print(f"{gender}: {result.point_est*100:.1f}% ± {result.stderror*100:.1f}%")

Example 3: Cross-Tabulation

hljs python

# Tabulate by two factors
results = {}

for race in df['race'].unique():
    results[race] = {}
    for education in df['education'].unique():
        subset = (df['race'] == race) & (df['education'] == education)
        
        if subset.sum() > 0:
            result = prop_estimator.estimate(
                y=df.loc[subset, 'outcome'].values,
                samp_weight=df.loc[subset, 'weight'].values,
                stratum=df.loc[subset, 'stratum'].values,
                psu=df.loc[subset, 'cluster'].values,
                remove_nan=True
            )
            results[race][education] = {
                'estimate': result.point_est,
                'se': result.stderror
            }

# Convert to DataFrame for display
import pandas as pd
table = pd.DataFrame(results).T
print(table)

Key Patterns

Pattern 1: Descriptive Analysis Pipeline

hljs python

import pandas as pd
import numpy as np
from samplics.estimation import TaylorEstimator

def descriptive_analysis(df, variables, weight_var, stratum_var=None, 
                         cluster_var=None, by_var=None):
    """
    Complete descriptive analysis for survey data.
    
    Parameters:
    -----------
    df : DataFrame
        Survey data
    variables : list of str
        Variables to analyze
    weight_var : str
        Sampling weight variable
    stratum_var : str, optional
        Stratum variable
    cluster_var : str, optional
        Cluster/PSU variable
    by_var : str, optional
        Grouping variable
        
    Returns:
    --------
    DataFrame with estimates and standard errors
    """
    results = []
    
    # Determine groups
    if by_var:
        groups = df[by_var].unique()
    else:
        groups = [None]
    
    # Loop through variables and groups
    for var in variables:
        for group in groups:
            # Subset data
            if group is not None:
                subset = df[df[by_var] == group]
                group_label = f"{by_var}={group}"
            else:
                subset = df
                group_label = "Overall"
            
            # Skip if empty
            if len(subset) == 0:
                continue
            
            # Prepare arrays
            y = subset[var].values
            weights = subset[weight_var].values
            stratum = subset[stratum_var].values if stratum_var else None
            psu = subset[cluster_var].values if cluster_var else None
            
            # Mean
            est_mean = TaylorEstimator("mean")
            result_mean = est_mean.estimate(
                y=y, samp_weight=weights, 
                stratum=stratum, psu=psu, remove_nan=True
            )
            
            # Median (using quantile estimator)
            est_median = TaylorEstimator("quantile")
            result_median = est_median.estimate(
                y=y, samp_weight=weights,
                stratum=stratum, psu=psu,
                q=0.5, remove_nan=True
            )
            
            # Total
            est_total = TaylorEstimator("total")
            result_total = est_total.estimate(
                y=y, samp_weight=weights,
                stratum=stratum, psu=psu, remove_nan=True
            )
            
            results.append({
                'Variable': var,
                'Group': group_label,
                'Mean': result_mean.point_est,
                'Mean_SE': result_mean.stderror,
                'Median': result_median.point_est,
                'Total': result_total.point_est,
                'Total_SE': result_total.stderror,
                'N': len(y[~np.isnan(y)])
            })
    
    return pd.DataFrame(results)

# Use it
results_df = descriptive_analysis(
    df=survey_df,
    variables=['income', 'age', 'bmi'],
    weight_var='weight',
    stratum_var='stratum',
    cluster_var='cluster',
    by_var='gender'
)
print(results_df)

Pattern 2: Time Series Analysis

hljs python

def trend_analysis(df, variable, time_var, weight_var, 
                   stratum_var=None, cluster_var=None):
    """
    Analyze trends over time in survey data.
    """
    results = []
    
    for time_point in sorted(df[time_var].unique()):
        subset = df[df[time_var] == time_point]
        
        estimator = TaylorEstimator("mean")
        result = estimator.estimate(
            y=subset[variable].values,
            samp_weight=subset[weight_var].values,
            stratum=subset[stratum_var].values if stratum_var else None,
            psu=subset[cluster_var].values if cluster_var else None,
            remove_nan=True
        )
        
        results.append({
            'Time': time_point,
            'Estimate': result.point_est,
            'SE': result.stderror,
            'Lower_CI': result.lower_ci,
            'Upper_CI': result.upper_ci
        })
    
    return pd.DataFrame(results)

# Use it
trends = trend_analysis(
    df=survey_df,
    variable='obesity_rate',
    time_var='year',
    weight_var='weight',
    stratum_var='stratum',
    cluster_var='cluster'
)

# Plot
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(trends['Time'], trends['Estimate'], marker='o', linewidth=2)
plt.fill_between(
    trends['Time'],
    trends['Lower_CI'],
    trends['Upper_CI'],
    alpha=0.3
)
plt.xlabel('Year')
plt.ylabel('Obesity Rate')
plt.title('Trend in Obesity Rate Over Time')
plt.grid(True, alpha=0.3)
plt.savefig('trend_plot.png', dpi=300, bbox_inches='tight')

Pattern 3: Regression with Survey Weights

hljs python

import statsmodels.api as sm
from statsmodels.regression.linear_model import WLS

def survey_regression(df, formula, weight_var, cluster_var=None):
    """
    Weighted regression accounting for survey design.
    
    Parameters:
    -----------
    df : DataFrame
        Survey data
    formula : str
        Regression formula (patsy style)
    weight_var : str
        Sampling weight variable
    cluster_var : str, optional
        Cluster variable for robust SE
        
    Returns:
    --------
    Regression results
    """
    # Fit weighted regression
    model = sm.formula.wls(
        formula=formula,
        data=df,
        weights=df[weight_var]
    )
    
    # Use robust SE if clusters provided
    if cluster_var:
        results = model.fit(cov_type='cluster', 
                           cov_kwds={'groups': df[cluster_var]})
    else:
        results = model.fit(cov_type='HC3')
    
    return results

# Use it
model = survey_regression(
    df=survey_df,
    formula='income ~ age + education + gender',
    weight_var='weight',
    cluster_var='cluster'
)

print(model.summary())

# Marginal effects (if marginaleffects installed)
try:
    from marginaleffects import avg_slopes
    ame = avg_slopes(model)
    print("\nAverage Marginal Effects:")
    print(ame)
except ImportError:
    print("\nInstall marginaleffects for marginal effects analysis")

Critical Best Practices

Data Preparation

✓ Always check for missing values:

hljs python

df = df.dropna(subset=['variable', 'weight'])

✓ Convert to numpy arrays:

hljs python

y = df['variable'].values  # Not df['variable']

✓ Ensure positive weights:

hljs python

assert (df['weight'] > 0).all(), "All weights must be positive"

Variance Estimation

✓ Always use proper design information:

hljs python

# Include stratum and cluster when available
result = estimator.estimate(
    y=y, samp_weight=weights,
    stratum=stratum, psu=psu  # Don't omit these!
)

✓ Set remove_nan=True for automatic missing data handling:

hljs python

result = estimator.estimate(..., remove_nan=True)

✓ Use appropriate confidence level:

hljs python

result = estimator.estimate(..., alpha=0.05)  # Default is 95% CI

Reporting

✓ Always report point estimates AND uncertainty:

hljs python

print(f"{result.point_est:.2f} ± {result.stderror:.2f}")
print(f"95% CI: [{result.lower_ci:.2f}, {result.upper_ci:.2f}]")

✓ Report design effects when relevant:

hljs python

deff = result.stderror**2 / (result.point_est * (1-result.point_est) / n)
print(f"Design Effect: {deff:.2f}")

✓ Include sample sizes:

hljs python

print(f"N = {len(y[~np.isnan(y)])}")

Common Pitfalls

❌ Forgetting to use .values

hljs python

# Bad - passes Series
result = estimator.estimate(y=df['var'], ...)

# Good - passes array
result = estimator.estimate(y=df['var'].values, ...)

❌ Ignoring survey design

hljs python

# Bad - no design info
result = estimator.estimate(y=y, samp_weight=weights)

# Good - includes stratum and cluster
result = estimator.estimate(
    y=y, samp_weight=weights,
    stratum=stratum, psu=psu
)

❌ Using unweighted statistics

hljs python

# Bad - simple mean
mean = df['income'].mean()

# Good - weighted with proper SE
estimator = TaylorEstimator("mean")
result = estimator.estimate(y=df['income'].values, 
                           samp_weight=df['weight'].values)

❌ Not reporting uncertainty

hljs python

# Bad - point estimate only
print(f"Mean: {result.point_est}")

# Good - with confidence interval
print(f"Mean: {result.point_est:.2f} (95% CI: [{result.lower_ci:.2f}, {result.upper_ci:.2f}])")

Available Estimators

TaylorEstimator Parameters

Parameter	Options	Description
`parameter`	"mean", "proportion", "total", "ratio", "quantile"	Quantity to estimate
`alpha`	0.05 (default)	Significance level for CI
`remove_nan`	True/False	Automatically handle missing
`single_psu`	"error", "adjust", "certainty"	Single PSU handling

Estimation Methods

Method	Purpose	Example
`estimate()`	Point estimate + variance	`estimator.estimate(y, samp_weight)`
`point_est`	Access point estimate	`result.point_est`
`stderror`	Access standard error	`result.stderror`
`lower_ci`	Lower confidence limit	`result.lower_ci`
`upper_ci`	Upper confidence limit	`result.upper_ci`

Visualization Patterns

Error Bar Plots

hljs python

import matplotlib.pyplot as plt

def plot_estimates(results_df, x_col, y_col, se_col, 
                  title="Survey Estimates", figsize=(10, 6)):
    """
    Create error bar plot for survey estimates.
    """
    fig, ax = plt.subplots(figsize=figsize)
    
    ax.errorbar(
        results_df[x_col],
        results_df[y_col],
        yerr=1.96 * results_df[se_col],  # 95% CI
        marker='o',
        linestyle='-',
        linewidth=2,
        capsize=5,
        capthick=2
    )
    
    ax.set_xlabel(x_col, fontsize=12)
    ax.set_ylabel(y_col, fontsize=12)
    ax.set_title(title, fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Rotate x-labels if many categories
    if len(results_df) > 5:
        plt.xticks(rotation=45, ha='right')
    
    plt.tight_layout()
    return fig

# Use it
fig = plot_estimates(
    results_df=trends,
    x_col='Year',
    y_col='Estimate',
    se_col='SE',
    title='Health Indicator Over Time'
)
plt.savefig('estimates_plot.png', dpi=300, bbox_inches='tight')

Grouped Bar Plots

hljs python

def plot_grouped_estimates(results_df, group_col, category_col, 
                          estimate_col, se_col):
    """
    Create grouped bar plot with error bars.
    """
    import numpy as np
    
    categories = results_df[category_col].unique()
    groups = results_df[group_col].unique()
    
    x = np.arange(len(categories))
    width = 0.8 / len(groups)
    
    fig, ax = plt.subplots(figsize=(12, 6))
    
    for i, group in enumerate(groups):
        subset = results_df[results_df[group_col] == group]
        
        ax.bar(
            x + i * width,
            subset[estimate_col],
            width,
            yerr=1.96 * subset[se_col],
            label=group,
            capsize=5
        )
    
    ax.set_xlabel(category_col, fontsize=12)
    ax.set_ylabel(estimate_col, fontsize=12)
    ax.set_xticks(x + width * (len(groups) - 1) / 2)
    ax.set_xticklabels(categories, rotation=45, ha='right')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    return fig

Integration with Statsmodels

hljs python

import statsmodels.api as sm
from samplics.estimation import TaylorEstimator

# 1. Fit weighted regression
model = sm.formula.wls(
    'outcome ~ age + gender + education',
    data=df,
    weights=df['weight']
).fit(cov_type='cluster', cov_kwds={'groups': df['cluster']})

print(model.summary())

# 2. Predicted values accounting for weights
df['predicted'] = model.predict(df)

# 3. Compute weighted mean of predictions
pred_estimator = TaylorEstimator("mean")
pred_result = pred_estimator.estimate(
    y=df['predicted'].values,
    samp_weight=df['weight'].values,
    stratum=df['stratum'].values,
    psu=df['cluster'].values
)

print(f"Mean predicted outcome: {pred_result.point_est:.2f}")

# 4. Model diagnostics with weights
weighted_residuals = model.resid * np.sqrt(df['weight'])

Advanced Topics

See detailed guides in references/:

Descriptive Statistics - Complete guide to all estimators

Tabulation Guide - Cross-tabs and categorical analysis

Regression Integration - Statsmodels integration patterns

Design Effects - Understanding complex design impacts

Troubleshooting

Issue: "ValueError: could not broadcast input array"

Cause: Passing pandas Series instead of numpy arrays
Fix: Use .values when passing to estimator

hljs python

y = df['variable'].values  # Add .values

Issue: Large standard errors

Cause: Small sample size or high design effect
Fix:

Check sample size: print(len(y[~np.isnan(y)]))
Compute design effect to understand inflation
Consider collapsing categories or pooling data

Issue: "All values are NaN"

Cause: Missing data in variable
Fix:

hljs python

# Check missing data
print(df['variable'].isna().sum())
# Use remove_nan=True
result = estimator.estimate(..., remove_nan=True)

Issue: "Single PSU in stratum"

Cause: Only one cluster in a stratum
Fix: Use single_psu adjustment

hljs python

result = estimator.estimate(
    ...,
    single_psu={"method": "adjust"}
)

Function Reference

Main Estimators

hljs python

from samplics.estimation import TaylorEstimator

# Create estimator
est = TaylorEstimator(parameter)  # parameter: "mean", "proportion", "total", "ratio", "quantile"

# Estimate
result = est.estimate(
    y,                    # Outcome variable (array)
    samp_weight,          # Sampling weights (array)
    stratum=None,         # Stratum IDs (array, optional)
    psu=None,             # Cluster IDs (array, optional)
    ssu=None,             # Secondary sampling units (array, optional)
    remove_nan=True,      # Remove missing values
    alpha=0.05,           # Significance level
    single_psu=None       # Single PSU handling
)

# Access results
result.point_est          # Point estimate
result.stderror           # Standard error
result.lower_ci           # Lower CI bound
result.upper_ci           # Upper CI bound

Additional Functions

hljs python

# Replicate weights (for alternative variance estimation)
from samplics.estimation import ReplicateEstimator

rep_est = ReplicateEstimator(
    parameter="mean",
    method="bootstrap"  # or "jackknife", "BRR"
)

Resources

Samplics documentation: https://samplics-org.github.io/samplics/

Survey sampling theory: Lohr (2019) "Sampling: Design and Analysis"

Complex surveys in practice: Lumley (2010) "Complex Surveys"

Scripts: Pre-built workflows in scripts/

Examples: Complete analyses in assets/

Installation Check

Run this to verify setup:

hljs python

import samplics
import pandas as pd
import numpy as np
print(f"samplics version: {samplics.__version__}")
print(f"pandas version: {pd.__version__}")
print("Setup complete!")

Getting Help

Check function docstrings: help(TaylorEstimator)
Review examples: assets/nhanes_example.py
Consult references: references/descriptive-statistics.md
Search docs: https://samplics-org.github.io/samplics/

Related Categories

← All Skills