skills /survey-data-explorer

Python Referenced

survey-data-explorer

Intelligent variable selection and recoding for wide survey data (1000+ columns). Analyzes labeled data from SPSS/Stata/SAS, reduces to 100-200 core variables, detects question hierarchies (main vs follow-ups), recommends binary recoding for interpretability, and generates comprehensive data dictionaries. Perfect for exploratory analysis of complex survey datasets with extensive metadata.

Survey Data Explorer

Intelligent variable selection and recoding strategy for wide survey datasets with labeled data.

🚀 How to Use This Skill

This skill guides you through survey data exploration in four phases:

Phase A: Data Loading & Profiling - Load labeled data, generate comprehensive descriptions
Phase B: Intelligent Variable Selection - Reduce 1000+ columns to 100-200 core variables
Phase C: Recoding Strategy - Binary-first recoding recommendations for interpretability
Phase D: Export & Documentation - Create codebooks, cleaning plans, and documentation

Usage Patterns:

Complete Workflow: Follow phases A→D for new wide survey data

Quick Exploration: Phase A only for initial understanding

Variable Selection: Phases A→B to identify core variables

Recoding Guidance: Phases A→C for cleaning strategy

The Wide Survey Data Challenge

Problem: Survey datasets often have 1000+ columns including:

Multiple versions of the same question (detailed follow-ups, intensity scales)

Questions with very high missing rates (>50%)

Administrative variables not needed for analysis

Multiple recodes of the same underlying variable

Solution: This skill provides intelligent variable selection that:

Keeps core variables: Time, geography, weights, demographics
Selects high-level outcomes: Main questions only, not detailed follow-ups
Filters by missing data: Drops variables with excessive missing values
Identifies question patterns: "Do you support X?" (keep) vs "How strongly?" (secondary)
Recommends binary recoding: For interpretable group-by analysis (% Agree, % Favor)

Typical Reduction: 1,000 columns → 100-200 core analysis variables

Prerequisites

Required packages:

hljs bash

pip install pandas numpy pyreadstat openpyxl tabulate --break-system-packages
# Or: pip install pandas numpy pyreadstat openpyxl tabulate

Python version: 3.8+

Package Dependencies:

pandas (>=1.3.0): Core data manipulation

numpy (>=1.21.0): Numerical operations

pyreadstat (>=1.1.0): Read SPSS/Stata/SAS with labels

openpyxl (>=3.0.0): Excel export

tabulate (>=0.8.0): Markdown table formatting

Bundled Resources

References (references/)

Load these when implementing survey data analysis or needing detailed guidance:

variable-selection-guide.md - Heuristics for identifying main questions vs follow-ups

recoding-patterns.md - Common recoding strategies and interpretations

troubleshooting.md - Common issues with pyreadstat, encoding, metadata

Scripts (scripts/)

Executable utilities for survey data exploration:

survey_data_utils.py - Core functions (describe_survey_data, print_survey_summary, export_descriptions)

variable_selection.py - Intelligent variable selection (select_core_variables, identify_question_type)

recoding_helpers.py - Recoding recommendations (recommend_recoding, suggest binary definitions)

Assets (assets/)

Production-ready examples and templates:

examples/GSS_example.py - Complete workflow with General Social Survey data

templates/cleaning_plan.xlsx - Template for manual recoding review

📊 PHASE A: DATA LOADING & PROFILING

When to use Phase A:

Starting with new survey data file (SPSS, Stata, SAS)

Need to understand what variables exist

Want comprehensive variable descriptions with metadata

Initial exploration before any analysis

Phase A Workflow

Load Data with Metadata
Generate Comprehensive Descriptions
Create Initial Summary

Quick Start: Load and Describe

hljs python

import pandas as pd
import pyreadstat
from scripts.survey_data_utils import describe_survey_data, print_survey_summary

# Load data with labels (CRITICAL: use pyreadstat, not pd.read_spss)
df, meta = pyreadstat.read_sav('survey.sav')
# Or for Stata: df, meta = pyreadstat.read_dta('survey.dta')

print(f"Loaded {len(df):,} observations, {len(df.columns):,} variables")

# Generate comprehensive descriptions
desc = describe_survey_data(df, meta, max_unique_display=20, sample_n=5)

# Print human-readable summary
print_survey_summary(desc, output_file='data_summary.md')

# Export descriptions
desc.to_csv('variable_descriptions.csv', index=False)

describe_survey_data() Output

Columns returned:

position: Variable position in dataset (1-indexed)

variable: Variable name

label: Variable label (question text from metadata)

dtype: Data type

n_obs: Total observations

n_missing: Missing count

pct_missing: Percentage missing

n_valid: Valid observations

likely_categorical: 'Yes', 'No', or 'Maybe'

n_unique: Number of unique values

unique_values: Dict of value frequencies (for categorical)

value_labels: Original value labels from metadata

min, max, mean, median, std: Numeric statistics (for continuous)

potential_missing_codes: Suspected missing value codes (-9, 97-99, 998-999)

sample_values: Sample of actual values

Critical Phase A Checks

✓ Verify before proceeding:

Metadata loaded correctly (meta object is not None)

Variable labels present (desc['label'] not all empty)

Value labels present for categorical variables

Missing data patterns identified

Variable counts make sense

🚨 Common issues:

Metadata not loading → Use pyreadstat, not pandas

Encoding errors → Specify encoding in read_sav/read_dta

Missing codes not flagged → Check potential_missing_codes column

🔍 PHASE B: INTELLIGENT VARIABLE SELECTION

When to use Phase B:

After Phase A (descriptions generated)

Working with wide data (500+ columns)

Need to reduce to core analysis variables

Want to identify main questions vs follow-ups

Phase B Workflow

Automated Filtering
Question Hierarchy Detection
Generate Selection Report

Variable Selection Pattern

hljs python

from scripts.variable_selection import select_core_variables, print_selection_summary

# Intelligent selection (1000+ → ~200 core variables)
selection = select_core_variables(
    desc,
    missing_threshold_outcomes=0.30,      # <30% missing for outcomes
    missing_threshold_demographics=0.50,  # <50% missing for demographics
    keep_follow_ups=False,                # Mark follow-ups as secondary
    min_categories=2,                     # Min unique values
    max_categories=20                     # Max unique values
)

# Print summary
print_selection_summary(selection)

# Access results
core_vars = selection['core']              # DataFrame of core variables
secondary_vars = selection['secondary']    # DataFrame of follow-up questions
excluded_vars = selection['exclude']       # DataFrame of excluded variables
summary_stats = selection['summary']       # Summary statistics

Selection Logic

Always KEEP (regardless of missing data):

Time variables: year, wave, date, time, cohort

Geography: state, region, county, FIPS, geo, country

Weights: weight, wgt, wt, wtssall

Design: stratum, strata, PSU, cluster, varunit

Keep if missing < threshold:

Demographics: age, sex, gender, education, income, race, ethnicity, religion, marital status, party ID, ideology

Outcomes (filter by missing + categories):

Must be categorical (likely_categorical = 'Yes')

Must have 2-20 unique values

Must have <30% missing (default threshold)

Main questions prioritized over follow-ups

Question Type Detection

Main Questions (KEEP):

Simple, direct phrasing

Shorter variable names

Base questions: "Do you approve...?", "Do you support...?", "Have you ever...?"

No conditional language

Follow-Up Questions (SECONDARY):

Intensity/strength: "How strongly...?", "How much...?"

Conditional: "If yes, then..."

Detailed breakdowns

Longer variable names

Suffixes: _STR, _INT, _DETAIL, _A, _B, _FOLLOWUP

Selection Results

Typical output:

======================================================================
VARIABLE SELECTION SUMMARY
======================================================================

📊 Overall
  Total variables: 1,056
  Core selected: 187
  Secondary (follow-ups): 142
  Excluded: 727
  Reduction: 68.8%

✅ Core Variables Breakdown
  Time: 3
  Geography: 5
  Weights: 4
  Demographics: 28
  Main outcome questions: 147

❌ Exclusions
  High missing data (>30%): 312
  Continuous/text variables: 89
  Too many categories (>20): 45
  Too few categories (<2): 18
  Follow-up questions: 142
  Administrative variables: 121

💡 PHASE C: RECODING STRATEGY

When to use Phase C:

After Phase B (core variables selected)

Need guidance on how to recode variables

Want binary versions for interpretability

Planning data cleaning pipeline

Phase C Workflow

Analyze Variable Structure
Generate Binary Recommendations
Suggest Ordinal Ordering
Create Recoding Plan

Recoding Recommendations Pattern

hljs python

from scripts.recoding_helpers import recommend_recoding

# Generate recoding plan (focused on core variables)
recoding = recommend_recoding(
    desc,
    selection_result=selection,  # Focus on core vars only
    binary_first=True            # Prioritize binary recoding
)

# View high-priority recodings
high_priority = recoding[recoding['priority'] == 'high']
print(high_priority[['variable', 'label', 'binary_definition', 'interpretation']])

# Save recoding plan
recoding.to_csv('recoding_plan.csv', index=False)
recoding.to_excel('recoding_plan.xlsx', index=False)

Recoding Strategy by Category Count

2 categories (already binary):

Strategy: Use as-is, verify coding (0/1 or 1/2?)

Priority: Low

Example: Male/Female → already binary

3 categories:

Strategy: Create binary by collapsing extremes, exclude middle

Binary: 1 = Agree, 0 = Disagree (exclude Neutral)

Also keep: 3-level ordinal version

Priority: High

Interpretation: "% who agree (excluding neutrals)"

4-5 categories (Likert scales):

Strategy: Binary (top 2 vs bottom 2) + keep ordinal

Binary: 1 = Agree/Strongly Agree, 0 = Disagree/Strongly Disagree

Also keep: Original 4-5 level ordinal

Priority: High

Interpretation: "% who agree" + "mean agreement score"

6-10 categories:

Strategy: Collapse to 3-4, then binary

Example: Education (6 levels) → 3 levels (Low/Mid/High) → binary (College+)

Priority: Medium

Interpretation: "% with college degree"

11+ categories:

Strategy: Manual review needed

Priority: Low

Check if truly categorical or needs different approach

Binary-First Philosophy

Why binary recoding?

Interpretability: "45% agree" > "Mean = 3.2 on 5-point scale"
Communication: Simple percentages facilitate understanding
Group comparisons: "College gap: 62% vs 38%" is clearer than regression coefficients
Visualization: Bar charts of % by group are intuitive
Statistical power: Sometimes clearer signal than ordinal

Best practices:

Create BOTH binary and ordinal versions

Binary for initial exploration and communication

Ordinal for modeling and detailed analysis

Document recoding rules clearly

Preserve original variables

Common Recoding Patterns

Pattern 1: Standard 5-point Likert

hljs python

# Original: 1=Strongly Disagree ... 5=Strongly Agree
# Binary: 1 = Agree (4-5), 0 = Disagree (1-2), NA = Neutral (3)
# Interpretation: "% who agree"

Pattern 2: Support/Oppose (4-point)

hljs python

# Original: 1=Strongly Oppose, 2=Oppose, 3=Favor, 4=Strongly Favor
# Binary: 1 = Favor (3-4), 0 = Oppose (1-2)
# Interpretation: "% who favor"

Pattern 3: Frequency (5-point)

hljs python

# Original: 1=Never, 2=Rarely, 3=Sometimes, 4=Often, 5=Always
# Binary: 1 = Frequent (4-5), 0 = Infrequent (1-3)
# Interpretation: "% who engage frequently"

Pattern 4: Education (multi-category)

hljs python

# Original: 6 categories (< HS, HS, Some college, Associate, Bachelor, Graduate)
# 3-level: Low (< HS, HS), Mid (Some college, Associate), High (Bachelor+)
# Binary: 1 = College degree (Bachelor+), 0 = No degree
# Interpretation: "% with college degree"

📄 PHASE D: EXPORT & DOCUMENTATION

When to use Phase D:

After Phase C (recoding strategy defined)

Need to share findings with team

Creating documentation for future analysis

Preparing data dictionary

Phase D Workflow

Export Descriptions
Create Cleaning Plan
Generate Codebook

Export Pattern

hljs python

from scripts.survey_data_utils import export_descriptions, create_cleaning_plan_template

# Export descriptions in multiple formats
files = export_descriptions(desc, base_filename='survey_2024')
# Creates:
#   survey_2024.csv
#   survey_2024.xlsx
#   survey_2024_table.md
#   survey_2024.json

# Create cleaning plan template (for core variables only)
template = create_cleaning_plan_template(
    selection['core'],
    output_file='cleaning_plan.xlsx'
)

# Print summary for documentation
print_survey_summary(desc, output_file='survey_summary.md')

Final Outputs

After completing all phases, you should have:

Selection files:

core_variables.csv - Selected ~100-200 variables (Phase B)

secondary_variables.csv - Follow-up questions (Phase B)

excluded_variables.csv - Filtered variables with reasons (Phase B)

Recoding files:

recoding_plan.csv - Binary definitions and strategies (Phase C)

recoding_plan.xlsx - Formatted for manual review (Phase C)

Documentation files:

survey_summary.md - Human-readable overview (Phase A)

variable_descriptions.csv - Full variable details (Phase A)

cleaning_plan.xlsx - Template for transformation tracking (Phase D)

🎯 QUICK REFERENCE: PHASE SELECTOR

When to Use Each Phase

User says...                              → Use Phase...

"Load this SPSS file"                     → Phase A
"What variables are in this survey?"      → Phase A
"Describe this survey data"               → Phase A

"Reduce 1000 columns to core variables"   → Phase B
"Select main questions"                   → Phase B
"Filter out high missing variables"       → Phase B

"How should I recode this?"               → Phase C
"Create binary versions"                  → Phase C
"Generate recoding strategy"              → Phase C

"Export variable descriptions"            → Phase D
"Create data dictionary"                  → Phase D
"Make codebook"                           → Phase D

Typical Phase Sequences

Complete Analysis (First Time):

A → B → C → D

Quick Exploration:

A only

Variable Selection Focus:

A → B

Recoding Guidance:

A → C

30-Second Quick Start

hljs python

import pyreadstat
from scripts.survey_data_utils import describe_survey_data
from scripts.variable_selection import select_core_variables
from scripts.recoding_helpers import recommend_recoding

# Load with labels
df, meta = pyreadstat.read_sav('survey.sav')

# Select core variables (1000+ → ~200)
desc = describe_survey_data(df, meta)
selection = select_core_variables(desc, missing_threshold_outcomes=0.30)

# Generate recoding plan
recoding = recommend_recoding(desc, selection, binary_first=True)

# Save results
selection['core'].to_csv('core_variables.csv')
recoding.to_csv('recoding_plan.csv')

print(f"✓ Selected {len(selection['core'])} from {len(df.columns)} variables")

Key Principles

Variable Selection

✅ Always keep: Time, geography, weights, demographics

✅ Main questions: Simple, direct attitude/behavior measures

⚠️ Secondary: Detailed follow-ups, intensity measures

❌ Exclude: High missing (>30%), non-categorical, administrative

Recoding Strategy

🎯 Primary: Binary (0/1) for % interpretation

📊 Secondary: Ordinal with proper ordering

🔄 Both: Keep both versions for flexibility

Interpretability

"45% agree" > "Mean = 3.2 on 5-point scale"

"College gap: 62% vs 38%" > "Ordinal regression coefficient: 0.83"

Simple percentages facilitate communication

When to Use This Skill

Perfect for:

✅ Wide survey data (500+ columns)

✅ Multiple time points (consistent reduction needed)

✅ Exploratory analysis (what variables exist?)

✅ Communication focus (need simple summaries)

✅ Group-by analysis (demographic differences)

Not needed for:

❌ Small surveys (<100 columns, manually review)

❌ Single-purpose surveys (all variables relevant)

❌ Pre-cleaned data (already reduced)

❌ Experimental data (different structure)

Function Reference

Phase A Functions

Function	Purpose	Example
`describe_survey_data()`	Comprehensive variable descriptions	`desc = describe_survey_data(df, meta)`
`print_survey_summary()`	Human-readable markdown summary	`print_survey_summary(desc, 'summary.md')`

Phase B Functions

Function	Purpose	Example
`select_core_variables()`	Intelligent variable selection	`selection = select_core_variables(desc)`
`print_selection_summary()`	Print selection statistics	`print_selection_summary(selection)`

Phase C Functions

Function	Purpose	Example
`recommend_recoding()`	Generate recoding plan	`recoding = recommend_recoding(desc, selection)`

Phase D Functions

Function	Purpose	Example
`export_descriptions()`	Export in multiple formats	`export_descriptions(desc, 'survey_2024')`
`create_cleaning_plan_template()`	Create Excel template	`create_cleaning_plan_template(desc)`

Troubleshooting

Issue: Metadata not loading

Cause: Using pandas instead of pyreadstat Fix: Use pyreadstat.read_sav() not pd.read_spss()

Issue: Encoding errors

Cause: Special characters in variable/value labels Fix: Specify encoding: pyreadstat.read_sav('file.sav', encoding='latin1')

Issue: Too many/too few variables selected

Cause: Threshold too strict/lenient Fix: Adjust missing_threshold_outcomes (default 0.30)

Issue: Binary recoding doesn't make sense

Cause: Non-ordinal variable or unusual scale Fix: Review manually, adjust in recoding_plan.csv

Resources

pyreadstat docs: https://github.com/Roche/pyreadstat

Scripts: Complete implementations in scripts/

Examples: Full workflows in assets/examples/

References: Detailed guides in references/

Getting Help

Check function docstrings: help(describe_survey_data)
Review examples: assets/examples/GSS_example.py
Consult references: references/troubleshooting.md
Verify package versions: pip list | grep -E "pyreadstat|pandas"

Related Categories

← All Skills