survey-data-explorer
Intelligent variable selection and recoding for wide survey data (1000+ columns). Analyzes labeled data from SPSS/Stata/SAS, reduces to 100-200 core variables, detects question hierarchies (main vs follow-ups), recommends binary recoding for interpretability, and generates comprehensive data dictionaries. Perfect for exploratory analysis of complex survey datasets with extensive metadata.
Survey Data Explorer
Intelligent variable selection and recoding strategy for wide survey datasets with labeled data.
š How to Use This Skill
This skill guides you through survey data exploration in four phases:
- Phase A: Data Loading & Profiling - Load labeled data, generate comprehensive descriptions
- Phase B: Intelligent Variable Selection - Reduce 1000+ columns to 100-200 core variables
- Phase C: Recoding Strategy - Binary-first recoding recommendations for interpretability
- Phase D: Export & Documentation - Create codebooks, cleaning plans, and documentation
Usage Patterns:
- Complete Workflow: Follow phases AāD for new wide survey data
- Quick Exploration: Phase A only for initial understanding
- Variable Selection: Phases AāB to identify core variables
- Recoding Guidance: Phases AāC for cleaning strategy
The Wide Survey Data Challenge
Problem: Survey datasets often have 1000+ columns including:
- Multiple versions of the same question (detailed follow-ups, intensity scales)
- Questions with very high missing rates (>50%)
- Administrative variables not needed for analysis
- Multiple recodes of the same underlying variable
Solution: This skill provides intelligent variable selection that:
- Keeps core variables: Time, geography, weights, demographics
- Selects high-level outcomes: Main questions only, not detailed follow-ups
- Filters by missing data: Drops variables with excessive missing values
- Identifies question patterns: "Do you support X?" (keep) vs "How strongly?" (secondary)
- Recommends binary recoding: For interpretable group-by analysis (% Agree, % Favor)
Typical Reduction: 1,000 columns ā 100-200 core analysis variables
Prerequisites
Required packages:
Python version: 3.8+
Package Dependencies:
pandas(>=1.3.0): Core data manipulationnumpy(>=1.21.0): Numerical operationspyreadstat(>=1.1.0): Read SPSS/Stata/SAS with labelsopenpyxl(>=3.0.0): Excel exporttabulate(>=0.8.0): Markdown table formatting
Bundled Resources
References (references/)
Load these when implementing survey data analysis or needing detailed guidance:
- variable-selection-guide.md - Heuristics for identifying main questions vs follow-ups
- recoding-patterns.md - Common recoding strategies and interpretations
- troubleshooting.md - Common issues with pyreadstat, encoding, metadata
Scripts (scripts/)
Executable utilities for survey data exploration:
- survey_data_utils.py - Core functions (describe_survey_data, print_survey_summary, export_descriptions)
- variable_selection.py - Intelligent variable selection (select_core_variables, identify_question_type)
- recoding_helpers.py - Recoding recommendations (recommend_recoding, suggest binary definitions)
Assets (assets/)
Production-ready examples and templates:
- examples/GSS_example.py - Complete workflow with General Social Survey data
- templates/cleaning_plan.xlsx - Template for manual recoding review
š PHASE A: DATA LOADING & PROFILING
When to use Phase A:
- Starting with new survey data file (SPSS, Stata, SAS)
- Need to understand what variables exist
- Want comprehensive variable descriptions with metadata
- Initial exploration before any analysis
Phase A Workflow
-
Load Data with Metadata
- Use pyreadstat to preserve variable labels and value labels
- Verify metadata loaded correctly
-
Generate Comprehensive Descriptions
- Analyze all variables systematically
- Detect categorical vs continuous
- Identify missing data patterns
- Flag potential missing codes
-
Create Initial Summary
- Categorize variables by type
- Count demographics, outcomes, design variables
- Identify high-missing variables
Quick Start: Load and Describe
describe_survey_data() Output
Columns returned:
position: Variable position in dataset (1-indexed)variable: Variable namelabel: Variable label (question text from metadata)dtype: Data typen_obs: Total observationsn_missing: Missing countpct_missing: Percentage missingn_valid: Valid observationslikely_categorical: 'Yes', 'No', or 'Maybe'n_unique: Number of unique valuesunique_values: Dict of value frequencies (for categorical)value_labels: Original value labels from metadatamin,max,mean,median,std: Numeric statistics (for continuous)potential_missing_codes: Suspected missing value codes (-9, 97-99, 998-999)sample_values: Sample of actual values
Critical Phase A Checks
ā Verify before proceeding:
- Metadata loaded correctly (
metaobject is not None) - Variable labels present (
desc['label']not all empty) - Value labels present for categorical variables
- Missing data patterns identified
- Variable counts make sense
šØ Common issues:
- Metadata not loading ā Use pyreadstat, not pandas
- Encoding errors ā Specify encoding in read_sav/read_dta
- Missing codes not flagged ā Check
potential_missing_codescolumn
š PHASE B: INTELLIGENT VARIABLE SELECTION
When to use Phase B:
- After Phase A (descriptions generated)
- Working with wide data (500+ columns)
- Need to reduce to core analysis variables
- Want to identify main questions vs follow-ups
Phase B Workflow
-
Automated Filtering
- Always keep: Time, geography, weights, design variables
- Filter demographics by missing threshold
- Filter outcomes by missing threshold and category counts
-
Question Hierarchy Detection
- Identify main questions (primary attitudes/behaviors)
- Flag follow-up questions (intensity, details, conditional)
- Mark as core vs secondary
-
Generate Selection Report
- Core variables (keep these)
- Secondary variables (optional, detailed analysis)
- Excluded variables (high missing, non-analyzable)
Variable Selection Pattern
Selection Logic
Always KEEP (regardless of missing data):
- Time variables: year, wave, date, time, cohort
- Geography: state, region, county, FIPS, geo, country
- Weights: weight, wgt, wt, wtssall
- Design: stratum, strata, PSU, cluster, varunit
Keep if missing < threshold:
- Demographics: age, sex, gender, education, income, race, ethnicity, religion, marital status, party ID, ideology
Outcomes (filter by missing + categories):
- Must be categorical (likely_categorical = 'Yes')
- Must have 2-20 unique values
- Must have <30% missing (default threshold)
- Main questions prioritized over follow-ups
Question Type Detection
Main Questions (KEEP):
- Simple, direct phrasing
- Shorter variable names
- Base questions: "Do you approve...?", "Do you support...?", "Have you ever...?"
- No conditional language
Follow-Up Questions (SECONDARY):
- Intensity/strength: "How strongly...?", "How much...?"
- Conditional: "If yes, then..."
- Detailed breakdowns
- Longer variable names
- Suffixes: _STR, _INT, _DETAIL, _A, _B, _FOLLOWUP
Selection Results
Typical output:
======================================================================
VARIABLE SELECTION SUMMARY
======================================================================
š Overall
Total variables: 1,056
Core selected: 187
Secondary (follow-ups): 142
Excluded: 727
Reduction: 68.8%
ā
Core Variables Breakdown
Time: 3
Geography: 5
Weights: 4
Demographics: 28
Main outcome questions: 147
ā Exclusions
High missing data (>30%): 312
Continuous/text variables: 89
Too many categories (>20): 45
Too few categories (<2): 18
Follow-up questions: 142
Administrative variables: 121
š” PHASE C: RECODING STRATEGY
When to use Phase C:
- After Phase B (core variables selected)
- Need guidance on how to recode variables
- Want binary versions for interpretability
- Planning data cleaning pipeline
Phase C Workflow
-
Analyze Variable Structure
- Review number of categories
- Parse value labels
- Identify ordinal vs nominal
-
Generate Binary Recommendations
- Suggest how to create 0/1 versions
- Prioritize interpretability (% who agree)
- Provide specific recoding rules
-
Suggest Ordinal Ordering
- Recommend factor level ordering
- Maintain natural progression
-
Create Recoding Plan
- Prioritize high-impact recodings
- Document interpretation guidance
Recoding Recommendations Pattern
Recoding Strategy by Category Count
2 categories (already binary):
- Strategy: Use as-is, verify coding (0/1 or 1/2?)
- Priority: Low
- Example: Male/Female ā already binary
3 categories:
- Strategy: Create binary by collapsing extremes, exclude middle
- Binary: 1 = Agree, 0 = Disagree (exclude Neutral)
- Also keep: 3-level ordinal version
- Priority: High
- Interpretation: "% who agree (excluding neutrals)"
4-5 categories (Likert scales):
- Strategy: Binary (top 2 vs bottom 2) + keep ordinal
- Binary: 1 = Agree/Strongly Agree, 0 = Disagree/Strongly Disagree
- Also keep: Original 4-5 level ordinal
- Priority: High
- Interpretation: "% who agree" + "mean agreement score"
6-10 categories:
- Strategy: Collapse to 3-4, then binary
- Example: Education (6 levels) ā 3 levels (Low/Mid/High) ā binary (College+)
- Priority: Medium
- Interpretation: "% with college degree"
11+ categories:
- Strategy: Manual review needed
- Priority: Low
- Check if truly categorical or needs different approach
Binary-First Philosophy
Why binary recoding?
- Interpretability: "45% agree" > "Mean = 3.2 on 5-point scale"
- Communication: Simple percentages facilitate understanding
- Group comparisons: "College gap: 62% vs 38%" is clearer than regression coefficients
- Visualization: Bar charts of % by group are intuitive
- Statistical power: Sometimes clearer signal than ordinal
Best practices:
- Create BOTH binary and ordinal versions
- Binary for initial exploration and communication
- Ordinal for modeling and detailed analysis
- Document recoding rules clearly
- Preserve original variables
Common Recoding Patterns
Pattern 1: Standard 5-point Likert
Pattern 2: Support/Oppose (4-point)
Pattern 3: Frequency (5-point)
Pattern 4: Education (multi-category)
š PHASE D: EXPORT & DOCUMENTATION
When to use Phase D:
- After Phase C (recoding strategy defined)
- Need to share findings with team
- Creating documentation for future analysis
- Preparing data dictionary
Phase D Workflow
-
Export Descriptions
- CSV for programmatic access
- Excel with formatting
- Markdown for human reading
- JSON for web applications
-
Create Cleaning Plan
- Template spreadsheet for manual review
- Document each recoding decision
- Track transformation rules
-
Generate Codebook
- Markdown documentation
- Variable descriptions with metadata
- Recoding rules
- Analysis notes
Export Pattern
Final Outputs
After completing all phases, you should have:
Selection files:
core_variables.csv- Selected ~100-200 variables (Phase B)secondary_variables.csv- Follow-up questions (Phase B)excluded_variables.csv- Filtered variables with reasons (Phase B)
Recoding files:
recoding_plan.csv- Binary definitions and strategies (Phase C)recoding_plan.xlsx- Formatted for manual review (Phase C)
Documentation files:
survey_summary.md- Human-readable overview (Phase A)variable_descriptions.csv- Full variable details (Phase A)cleaning_plan.xlsx- Template for transformation tracking (Phase D)
šÆ QUICK REFERENCE: PHASE SELECTOR
When to Use Each Phase
User says... ā Use Phase...
"Load this SPSS file" ā Phase A
"What variables are in this survey?" ā Phase A
"Describe this survey data" ā Phase A
"Reduce 1000 columns to core variables" ā Phase B
"Select main questions" ā Phase B
"Filter out high missing variables" ā Phase B
"How should I recode this?" ā Phase C
"Create binary versions" ā Phase C
"Generate recoding strategy" ā Phase C
"Export variable descriptions" ā Phase D
"Create data dictionary" ā Phase D
"Make codebook" ā Phase D
Typical Phase Sequences
Complete Analysis (First Time):
A ā B ā C ā D
Quick Exploration:
A only
Variable Selection Focus:
A ā B
Recoding Guidance:
A ā C
30-Second Quick Start
Key Principles
Variable Selection
- ā Always keep: Time, geography, weights, demographics
- ā Main questions: Simple, direct attitude/behavior measures
- ā ļø Secondary: Detailed follow-ups, intensity measures
- ā Exclude: High missing (>30%), non-categorical, administrative
Recoding Strategy
- šÆ Primary: Binary (0/1) for % interpretation
- š Secondary: Ordinal with proper ordering
- š Both: Keep both versions for flexibility
Interpretability
- "45% agree" > "Mean = 3.2 on 5-point scale"
- "College gap: 62% vs 38%" > "Ordinal regression coefficient: 0.83"
- Simple percentages facilitate communication
When to Use This Skill
Perfect for:
- ā Wide survey data (500+ columns)
- ā Multiple time points (consistent reduction needed)
- ā Exploratory analysis (what variables exist?)
- ā Communication focus (need simple summaries)
- ā Group-by analysis (demographic differences)
Not needed for:
- ā Small surveys (<100 columns, manually review)
- ā Single-purpose surveys (all variables relevant)
- ā Pre-cleaned data (already reduced)
- ā Experimental data (different structure)
Function Reference
Phase A Functions
| Function | Purpose | Example |
|---|---|---|
describe_survey_data() | Comprehensive variable descriptions | desc = describe_survey_data(df, meta) |
print_survey_summary() | Human-readable markdown summary | print_survey_summary(desc, 'summary.md') |
Phase B Functions
| Function | Purpose | Example |
|---|---|---|
select_core_variables() | Intelligent variable selection | selection = select_core_variables(desc) |
print_selection_summary() | Print selection statistics | print_selection_summary(selection) |
Phase C Functions
| Function | Purpose | Example |
|---|---|---|
recommend_recoding() | Generate recoding plan | recoding = recommend_recoding(desc, selection) |
Phase D Functions
| Function | Purpose | Example |
|---|---|---|
export_descriptions() | Export in multiple formats | export_descriptions(desc, 'survey_2024') |
create_cleaning_plan_template() | Create Excel template | create_cleaning_plan_template(desc) |
Troubleshooting
Issue: Metadata not loading
Cause: Using pandas instead of pyreadstat
Fix: Use pyreadstat.read_sav() not pd.read_spss()
Issue: Encoding errors
Cause: Special characters in variable/value labels
Fix: Specify encoding: pyreadstat.read_sav('file.sav', encoding='latin1')
Issue: Too many/too few variables selected
Cause: Threshold too strict/lenient
Fix: Adjust missing_threshold_outcomes (default 0.30)
Issue: Binary recoding doesn't make sense
Cause: Non-ordinal variable or unusual scale
Fix: Review manually, adjust in recoding_plan.csv
Resources
- pyreadstat docs: https://github.com/Roche/pyreadstat
- Scripts: Complete implementations in
scripts/ - Examples: Full workflows in
assets/examples/ - References: Detailed guides in
references/
Getting Help
- Check function docstrings:
help(describe_survey_data) - Review examples:
assets/examples/GSS_example.py - Consult references:
references/troubleshooting.md - Verify package versions:
pip list | grep -E "pyreadstat|pandas"