skills /survey-data-explorer
Python Referenced

survey-data-explorer

Intelligent variable selection and recoding for wide survey data (1000+ columns). Analyzes labeled data from SPSS/Stata/SAS, reduces to 100-200 core variables, detects question hierarchies (main vs follow-ups), recommends binary recoding for interpretability, and generates comprehensive data dictionaries. Perfect for exploratory analysis of complex survey datasets with extensive metadata.

Survey Data Explorer

Intelligent variable selection and recoding strategy for wide survey datasets with labeled data.

šŸš€ How to Use This Skill

This skill guides you through survey data exploration in four phases:

  1. Phase A: Data Loading & Profiling - Load labeled data, generate comprehensive descriptions
  2. Phase B: Intelligent Variable Selection - Reduce 1000+ columns to 100-200 core variables
  3. Phase C: Recoding Strategy - Binary-first recoding recommendations for interpretability
  4. Phase D: Export & Documentation - Create codebooks, cleaning plans, and documentation

Usage Patterns:

  • Complete Workflow: Follow phases A→D for new wide survey data
  • Quick Exploration: Phase A only for initial understanding
  • Variable Selection: Phases A→B to identify core variables
  • Recoding Guidance: Phases A→C for cleaning strategy

The Wide Survey Data Challenge

Problem: Survey datasets often have 1000+ columns including:

  • Multiple versions of the same question (detailed follow-ups, intensity scales)
  • Questions with very high missing rates (>50%)
  • Administrative variables not needed for analysis
  • Multiple recodes of the same underlying variable

Solution: This skill provides intelligent variable selection that:

  1. Keeps core variables: Time, geography, weights, demographics
  2. Selects high-level outcomes: Main questions only, not detailed follow-ups
  3. Filters by missing data: Drops variables with excessive missing values
  4. Identifies question patterns: "Do you support X?" (keep) vs "How strongly?" (secondary)
  5. Recommends binary recoding: For interpretable group-by analysis (% Agree, % Favor)

Typical Reduction: 1,000 columns → 100-200 core analysis variables


Prerequisites

Required packages:

hljs bash
pip install pandas numpy pyreadstat openpyxl tabulate --break-system-packages # Or: pip install pandas numpy pyreadstat openpyxl tabulate

Python version: 3.8+

Package Dependencies:

  • pandas (>=1.3.0): Core data manipulation
  • numpy (>=1.21.0): Numerical operations
  • pyreadstat (>=1.1.0): Read SPSS/Stata/SAS with labels
  • openpyxl (>=3.0.0): Excel export
  • tabulate (>=0.8.0): Markdown table formatting

Bundled Resources

References (references/)

Load these when implementing survey data analysis or needing detailed guidance:

  • variable-selection-guide.md - Heuristics for identifying main questions vs follow-ups
  • recoding-patterns.md - Common recoding strategies and interpretations
  • troubleshooting.md - Common issues with pyreadstat, encoding, metadata

Scripts (scripts/)

Executable utilities for survey data exploration:

  • survey_data_utils.py - Core functions (describe_survey_data, print_survey_summary, export_descriptions)
  • variable_selection.py - Intelligent variable selection (select_core_variables, identify_question_type)
  • recoding_helpers.py - Recoding recommendations (recommend_recoding, suggest binary definitions)

Assets (assets/)

Production-ready examples and templates:

  • examples/GSS_example.py - Complete workflow with General Social Survey data
  • templates/cleaning_plan.xlsx - Template for manual recoding review

šŸ“Š PHASE A: DATA LOADING & PROFILING

When to use Phase A:

  • Starting with new survey data file (SPSS, Stata, SAS)
  • Need to understand what variables exist
  • Want comprehensive variable descriptions with metadata
  • Initial exploration before any analysis

Phase A Workflow

  1. Load Data with Metadata

    • Use pyreadstat to preserve variable labels and value labels
    • Verify metadata loaded correctly
  2. Generate Comprehensive Descriptions

    • Analyze all variables systematically
    • Detect categorical vs continuous
    • Identify missing data patterns
    • Flag potential missing codes
  3. Create Initial Summary

    • Categorize variables by type
    • Count demographics, outcomes, design variables
    • Identify high-missing variables

Quick Start: Load and Describe

hljs python
import pandas as pd import pyreadstat from scripts.survey_data_utils import describe_survey_data, print_survey_summary # Load data with labels (CRITICAL: use pyreadstat, not pd.read_spss) df, meta = pyreadstat.read_sav('survey.sav') # Or for Stata: df, meta = pyreadstat.read_dta('survey.dta') print(f"Loaded {len(df):,} observations, {len(df.columns):,} variables") # Generate comprehensive descriptions desc = describe_survey_data(df, meta, max_unique_display=20, sample_n=5) # Print human-readable summary print_survey_summary(desc, output_file='data_summary.md') # Export descriptions desc.to_csv('variable_descriptions.csv', index=False)

describe_survey_data() Output

Columns returned:

  • position: Variable position in dataset (1-indexed)
  • variable: Variable name
  • label: Variable label (question text from metadata)
  • dtype: Data type
  • n_obs: Total observations
  • n_missing: Missing count
  • pct_missing: Percentage missing
  • n_valid: Valid observations
  • likely_categorical: 'Yes', 'No', or 'Maybe'
  • n_unique: Number of unique values
  • unique_values: Dict of value frequencies (for categorical)
  • value_labels: Original value labels from metadata
  • min, max, mean, median, std: Numeric statistics (for continuous)
  • potential_missing_codes: Suspected missing value codes (-9, 97-99, 998-999)
  • sample_values: Sample of actual values

Critical Phase A Checks

āœ“ Verify before proceeding:

  • Metadata loaded correctly (meta object is not None)
  • Variable labels present (desc['label'] not all empty)
  • Value labels present for categorical variables
  • Missing data patterns identified
  • Variable counts make sense

🚨 Common issues:

  • Metadata not loading → Use pyreadstat, not pandas
  • Encoding errors → Specify encoding in read_sav/read_dta
  • Missing codes not flagged → Check potential_missing_codes column

šŸ” PHASE B: INTELLIGENT VARIABLE SELECTION

When to use Phase B:

  • After Phase A (descriptions generated)
  • Working with wide data (500+ columns)
  • Need to reduce to core analysis variables
  • Want to identify main questions vs follow-ups

Phase B Workflow

  1. Automated Filtering

    • Always keep: Time, geography, weights, design variables
    • Filter demographics by missing threshold
    • Filter outcomes by missing threshold and category counts
  2. Question Hierarchy Detection

    • Identify main questions (primary attitudes/behaviors)
    • Flag follow-up questions (intensity, details, conditional)
    • Mark as core vs secondary
  3. Generate Selection Report

    • Core variables (keep these)
    • Secondary variables (optional, detailed analysis)
    • Excluded variables (high missing, non-analyzable)

Variable Selection Pattern

hljs python
from scripts.variable_selection import select_core_variables, print_selection_summary # Intelligent selection (1000+ → ~200 core variables) selection = select_core_variables( desc, missing_threshold_outcomes=0.30, # <30% missing for outcomes missing_threshold_demographics=0.50, # <50% missing for demographics keep_follow_ups=False, # Mark follow-ups as secondary min_categories=2, # Min unique values max_categories=20 # Max unique values ) # Print summary print_selection_summary(selection) # Access results core_vars = selection['core'] # DataFrame of core variables secondary_vars = selection['secondary'] # DataFrame of follow-up questions excluded_vars = selection['exclude'] # DataFrame of excluded variables summary_stats = selection['summary'] # Summary statistics

Selection Logic

Always KEEP (regardless of missing data):

  • Time variables: year, wave, date, time, cohort
  • Geography: state, region, county, FIPS, geo, country
  • Weights: weight, wgt, wt, wtssall
  • Design: stratum, strata, PSU, cluster, varunit

Keep if missing < threshold:

  • Demographics: age, sex, gender, education, income, race, ethnicity, religion, marital status, party ID, ideology

Outcomes (filter by missing + categories):

  • Must be categorical (likely_categorical = 'Yes')
  • Must have 2-20 unique values
  • Must have <30% missing (default threshold)
  • Main questions prioritized over follow-ups

Question Type Detection

Main Questions (KEEP):

  • Simple, direct phrasing
  • Shorter variable names
  • Base questions: "Do you approve...?", "Do you support...?", "Have you ever...?"
  • No conditional language

Follow-Up Questions (SECONDARY):

  • Intensity/strength: "How strongly...?", "How much...?"
  • Conditional: "If yes, then..."
  • Detailed breakdowns
  • Longer variable names
  • Suffixes: _STR, _INT, _DETAIL, _A, _B, _FOLLOWUP

Selection Results

Typical output:

====================================================================== VARIABLE SELECTION SUMMARY ====================================================================== šŸ“Š Overall Total variables: 1,056 Core selected: 187 Secondary (follow-ups): 142 Excluded: 727 Reduction: 68.8% āœ… Core Variables Breakdown Time: 3 Geography: 5 Weights: 4 Demographics: 28 Main outcome questions: 147 āŒ Exclusions High missing data (>30%): 312 Continuous/text variables: 89 Too many categories (>20): 45 Too few categories (<2): 18 Follow-up questions: 142 Administrative variables: 121

šŸ’” PHASE C: RECODING STRATEGY

When to use Phase C:

  • After Phase B (core variables selected)
  • Need guidance on how to recode variables
  • Want binary versions for interpretability
  • Planning data cleaning pipeline

Phase C Workflow

  1. Analyze Variable Structure

    • Review number of categories
    • Parse value labels
    • Identify ordinal vs nominal
  2. Generate Binary Recommendations

    • Suggest how to create 0/1 versions
    • Prioritize interpretability (% who agree)
    • Provide specific recoding rules
  3. Suggest Ordinal Ordering

    • Recommend factor level ordering
    • Maintain natural progression
  4. Create Recoding Plan

    • Prioritize high-impact recodings
    • Document interpretation guidance

Recoding Recommendations Pattern

hljs python
from scripts.recoding_helpers import recommend_recoding # Generate recoding plan (focused on core variables) recoding = recommend_recoding( desc, selection_result=selection, # Focus on core vars only binary_first=True # Prioritize binary recoding ) # View high-priority recodings high_priority = recoding[recoding['priority'] == 'high'] print(high_priority[['variable', 'label', 'binary_definition', 'interpretation']]) # Save recoding plan recoding.to_csv('recoding_plan.csv', index=False) recoding.to_excel('recoding_plan.xlsx', index=False)

Recoding Strategy by Category Count

2 categories (already binary):

  • Strategy: Use as-is, verify coding (0/1 or 1/2?)
  • Priority: Low
  • Example: Male/Female → already binary

3 categories:

  • Strategy: Create binary by collapsing extremes, exclude middle
  • Binary: 1 = Agree, 0 = Disagree (exclude Neutral)
  • Also keep: 3-level ordinal version
  • Priority: High
  • Interpretation: "% who agree (excluding neutrals)"

4-5 categories (Likert scales):

  • Strategy: Binary (top 2 vs bottom 2) + keep ordinal
  • Binary: 1 = Agree/Strongly Agree, 0 = Disagree/Strongly Disagree
  • Also keep: Original 4-5 level ordinal
  • Priority: High
  • Interpretation: "% who agree" + "mean agreement score"

6-10 categories:

  • Strategy: Collapse to 3-4, then binary
  • Example: Education (6 levels) → 3 levels (Low/Mid/High) → binary (College+)
  • Priority: Medium
  • Interpretation: "% with college degree"

11+ categories:

  • Strategy: Manual review needed
  • Priority: Low
  • Check if truly categorical or needs different approach

Binary-First Philosophy

Why binary recoding?

  1. Interpretability: "45% agree" > "Mean = 3.2 on 5-point scale"
  2. Communication: Simple percentages facilitate understanding
  3. Group comparisons: "College gap: 62% vs 38%" is clearer than regression coefficients
  4. Visualization: Bar charts of % by group are intuitive
  5. Statistical power: Sometimes clearer signal than ordinal

Best practices:

  • Create BOTH binary and ordinal versions
  • Binary for initial exploration and communication
  • Ordinal for modeling and detailed analysis
  • Document recoding rules clearly
  • Preserve original variables

Common Recoding Patterns

Pattern 1: Standard 5-point Likert

hljs python
# Original: 1=Strongly Disagree ... 5=Strongly Agree # Binary: 1 = Agree (4-5), 0 = Disagree (1-2), NA = Neutral (3) # Interpretation: "% who agree"

Pattern 2: Support/Oppose (4-point)

hljs python
# Original: 1=Strongly Oppose, 2=Oppose, 3=Favor, 4=Strongly Favor # Binary: 1 = Favor (3-4), 0 = Oppose (1-2) # Interpretation: "% who favor"

Pattern 3: Frequency (5-point)

hljs python
# Original: 1=Never, 2=Rarely, 3=Sometimes, 4=Often, 5=Always # Binary: 1 = Frequent (4-5), 0 = Infrequent (1-3) # Interpretation: "% who engage frequently"

Pattern 4: Education (multi-category)

hljs python
# Original: 6 categories (< HS, HS, Some college, Associate, Bachelor, Graduate) # 3-level: Low (< HS, HS), Mid (Some college, Associate), High (Bachelor+) # Binary: 1 = College degree (Bachelor+), 0 = No degree # Interpretation: "% with college degree"

šŸ“„ PHASE D: EXPORT & DOCUMENTATION

When to use Phase D:

  • After Phase C (recoding strategy defined)
  • Need to share findings with team
  • Creating documentation for future analysis
  • Preparing data dictionary

Phase D Workflow

  1. Export Descriptions

    • CSV for programmatic access
    • Excel with formatting
    • Markdown for human reading
    • JSON for web applications
  2. Create Cleaning Plan

    • Template spreadsheet for manual review
    • Document each recoding decision
    • Track transformation rules
  3. Generate Codebook

    • Markdown documentation
    • Variable descriptions with metadata
    • Recoding rules
    • Analysis notes

Export Pattern

hljs python
from scripts.survey_data_utils import export_descriptions, create_cleaning_plan_template # Export descriptions in multiple formats files = export_descriptions(desc, base_filename='survey_2024') # Creates: # survey_2024.csv # survey_2024.xlsx # survey_2024_table.md # survey_2024.json # Create cleaning plan template (for core variables only) template = create_cleaning_plan_template( selection['core'], output_file='cleaning_plan.xlsx' ) # Print summary for documentation print_survey_summary(desc, output_file='survey_summary.md')

Final Outputs

After completing all phases, you should have:

Selection files:

  • core_variables.csv - Selected ~100-200 variables (Phase B)
  • secondary_variables.csv - Follow-up questions (Phase B)
  • excluded_variables.csv - Filtered variables with reasons (Phase B)

Recoding files:

  • recoding_plan.csv - Binary definitions and strategies (Phase C)
  • recoding_plan.xlsx - Formatted for manual review (Phase C)

Documentation files:

  • survey_summary.md - Human-readable overview (Phase A)
  • variable_descriptions.csv - Full variable details (Phase A)
  • cleaning_plan.xlsx - Template for transformation tracking (Phase D)

šŸŽÆ QUICK REFERENCE: PHASE SELECTOR

When to Use Each Phase

User says... → Use Phase... "Load this SPSS file" → Phase A "What variables are in this survey?" → Phase A "Describe this survey data" → Phase A "Reduce 1000 columns to core variables" → Phase B "Select main questions" → Phase B "Filter out high missing variables" → Phase B "How should I recode this?" → Phase C "Create binary versions" → Phase C "Generate recoding strategy" → Phase C "Export variable descriptions" → Phase D "Create data dictionary" → Phase D "Make codebook" → Phase D

Typical Phase Sequences

Complete Analysis (First Time):

A → B → C → D

Quick Exploration:

A only

Variable Selection Focus:

A → B

Recoding Guidance:

A → C

30-Second Quick Start

hljs python
import pyreadstat from scripts.survey_data_utils import describe_survey_data from scripts.variable_selection import select_core_variables from scripts.recoding_helpers import recommend_recoding # Load with labels df, meta = pyreadstat.read_sav('survey.sav') # Select core variables (1000+ → ~200) desc = describe_survey_data(df, meta) selection = select_core_variables(desc, missing_threshold_outcomes=0.30) # Generate recoding plan recoding = recommend_recoding(desc, selection, binary_first=True) # Save results selection['core'].to_csv('core_variables.csv') recoding.to_csv('recoding_plan.csv') print(f"āœ“ Selected {len(selection['core'])} from {len(df.columns)} variables")

Key Principles

Variable Selection

  • āœ… Always keep: Time, geography, weights, demographics
  • āœ… Main questions: Simple, direct attitude/behavior measures
  • āš ļø Secondary: Detailed follow-ups, intensity measures
  • āŒ Exclude: High missing (>30%), non-categorical, administrative

Recoding Strategy

  • šŸŽÆ Primary: Binary (0/1) for % interpretation
  • šŸ“Š Secondary: Ordinal with proper ordering
  • šŸ”„ Both: Keep both versions for flexibility

Interpretability

  • "45% agree" > "Mean = 3.2 on 5-point scale"
  • "College gap: 62% vs 38%" > "Ordinal regression coefficient: 0.83"
  • Simple percentages facilitate communication

When to Use This Skill

Perfect for:

  • āœ… Wide survey data (500+ columns)
  • āœ… Multiple time points (consistent reduction needed)
  • āœ… Exploratory analysis (what variables exist?)
  • āœ… Communication focus (need simple summaries)
  • āœ… Group-by analysis (demographic differences)

Not needed for:

  • āŒ Small surveys (<100 columns, manually review)
  • āŒ Single-purpose surveys (all variables relevant)
  • āŒ Pre-cleaned data (already reduced)
  • āŒ Experimental data (different structure)

Function Reference

Phase A Functions

FunctionPurposeExample
describe_survey_data()Comprehensive variable descriptionsdesc = describe_survey_data(df, meta)
print_survey_summary()Human-readable markdown summaryprint_survey_summary(desc, 'summary.md')

Phase B Functions

FunctionPurposeExample
select_core_variables()Intelligent variable selectionselection = select_core_variables(desc)
print_selection_summary()Print selection statisticsprint_selection_summary(selection)

Phase C Functions

FunctionPurposeExample
recommend_recoding()Generate recoding planrecoding = recommend_recoding(desc, selection)

Phase D Functions

FunctionPurposeExample
export_descriptions()Export in multiple formatsexport_descriptions(desc, 'survey_2024')
create_cleaning_plan_template()Create Excel templatecreate_cleaning_plan_template(desc)

Troubleshooting

Issue: Metadata not loading

Cause: Using pandas instead of pyreadstat Fix: Use pyreadstat.read_sav() not pd.read_spss()

Issue: Encoding errors

Cause: Special characters in variable/value labels Fix: Specify encoding: pyreadstat.read_sav('file.sav', encoding='latin1')

Issue: Too many/too few variables selected

Cause: Threshold too strict/lenient Fix: Adjust missing_threshold_outcomes (default 0.30)

Issue: Binary recoding doesn't make sense

Cause: Non-ordinal variable or unusual scale Fix: Review manually, adjust in recoding_plan.csv


Resources

  • Scripts: Complete implementations in scripts/
  • Examples: Full workflows in assets/examples/
  • References: Detailed guides in references/

Getting Help

  1. Check function docstrings: help(describe_survey_data)
  2. Review examples: assets/examples/GSS_example.py
  3. Consult references: references/troubleshooting.md
  4. Verify package versions: pip list | grep -E "pyreadstat|pandas"

Related Categories