commit 5d9906c5c6df32a4581032675430431bb7c130b6
parent 49dd8afd730dec99c8ce647ac17470c162469ec1
Author: Erik Loualiche <[email protected]>
Date: Sun, 22 Mar 2026 13:57:44 -0500
Add date range validation, dynamic Compustat schema, update docs
- _validate_date_range: warns on reversed/ancient/future dates, called
in all WRDS import functions
- _get_funda_columns: queries comp.funda schema at runtime, caches per
session, falls back to hardcoded list on error
- README: document FF5, momentum, portfolio returns, diagnose
- docs/src/index.md: rewrite usage section with all new features
- docs/src/man/yield_curve_gsw.md: update missing flag list
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Diffstat:
6 files changed, 118 insertions(+), 35 deletions(-)
diff --git a/README.md b/README.md
@@ -7,12 +7,14 @@
`FinanceRoutines.jl` is a package that contains useful functions to download and process academic financial data.
-So far the package provides function to import:
+The package provides functions to:
- - CRSP and Compustat from the WRDS Postgres server
- - Fama-French three factors series from Ken French's website
- - GSW Yield curves from the [NY Fed](https://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html)
- - Estimation of betas for stocks
+ - Import CRSP and Compustat from the WRDS Postgres server
+ - Import Fama-French 3-factor, 5-factor, and momentum series from Ken French's website
+ - Import GSW yield curves from the [Federal Reserve](https://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html) and compute bond returns
+ - Estimate rolling betas for stocks
+ - Calculate equal-weighted and value-weighted portfolio returns
+ - Run data quality diagnostics on financial DataFrames
## Installation
@@ -73,14 +75,20 @@ df_msf = link_MSF(df_linktable,
df_msf = innerjoin(df_msf, df_funda, on = [:gvkey, :datey], matchmissing=:notequal)
```
-### Import the Fama-French three factors
+### Import Fama-French factors
-This downloads directly data from Ken French's website and formats the data
+Download directly from Ken French's website. Supports 3-factor, 5-factor, and momentum at daily/monthly/annual frequency.
```julia
+# 3-factor model
df_FF3 = import_FF3()
-# there is an option to download the daily factors
df_FF3_daily = import_FF3(frequency=:daily)
+
+# 5-factor model (adds profitability RMW and investment CMA)
+df_FF5 = import_FF5()
+
+# Momentum factor
+df_mom = import_FF_momentum()
```
### Estimate treasury bond returns
@@ -103,15 +111,30 @@ transform!(df_GSW,
See the [doc](https://louloulibs.github.io/FinanceRoutines.jl/) and tests for more options.
-### Common operations in asset pricing
+### Portfolio returns
-Look in the documentation for a guide on how to estimate betas: over the whole sample and using rolling regressions.
-The package exports the function `calculate_rolling_betas`.
+```julia
+# Equal-weighted portfolio returns by date
+df_ew = calculate_portfolio_returns(df_msf, :ret, :datem; weighting=:equal)
+
+# Value-weighted by market cap, grouped by size quintile
+df_vw = calculate_portfolio_returns(df_msf, :ret, :datem;
+ weighting=:value, weight_col=:mktcap, groups=:size_quintile)
+```
+
+### Data diagnostics
+```julia
+report = diagnose(df_msf)
+report[:missing_rates] # fraction missing per column
+report[:duplicate_keys] # duplicate (permno, date) pairs
+report[:suspicious_values] # extreme returns, negative prices
+```
-## To Do
+### Common operations in asset pricing
- - `olsgmm` from cochrane GMM code
+Look in the documentation for a guide on how to estimate betas: over the whole sample and using rolling regressions.
+The package exports `calculate_rolling_betas`.
## Other references to work with financial data
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -4,12 +4,10 @@
## Introduction
-This package provides a collection of routines for academic finance work.
-This is useful to get started with a clean copy of asset prices from CRSP and a ad-hoc merge with the accounting data from the Compustat Funda file.
+This package provides a collection of routines for academic finance work.
+It gives you a clean pipeline from raw WRDS data (CRSP, Compustat) through to standard research datasets, plus tools for Fama-French factors, treasury yield curves, portfolio construction, and data diagnostics.
-I have also added utilities to download treasury yield curves (GSW) and Fama-French research factors.
-
-This is still very much work in progress: file [issues](https://github.com/louloulibs/FinanceRoutines.jl/issues) for comments.
+File [issues](https://github.com/louloulibs/FinanceRoutines.jl/issues) for comments.
## Installation
@@ -31,16 +29,27 @@ Pkg.add("https://github.com/louloulibs/FinanceRoutines.jl")
## Usage
- - Using WRDS (CRSP, Compustat, etc)
- + See the [WRDS User Guide](@ref) for an introduction to using the package to download data from WRDS
- + See [Transitioning to the new CRSP file format](@ref) for a guide on converting from SIZ to CIZ
+ - WRDS (CRSP, Compustat)
+ + [WRDS User Guide](@ref) — download and merge CRSP/Compustat data
+ + [Transitioning to the new CRSP file format](@ref) — SIZ to CIZ migration
+
+ - Fama-French factors
+ + `import_FF3()` — 3-factor model (market, size, value)
+ + `import_FF5()` — 5-factor model (adds profitability, investment)
+ + `import_FF_momentum()` — momentum factor
+ + All support `:daily`, `:monthly`, `:annual` frequencies
- Treasury yield curves
- + See the [Import Yield Curve Data](@ref) guide for GSW yield curve parameters and bond return calculations
+ + [Import Yield Curve Data](@ref) — GSW parameters, yields, prices, bond returns
+
+ - Portfolio analytics
+ + `calculate_portfolio_returns` — equal/value-weighted returns with optional grouping
+ + `calculate_rolling_betas` — rolling window factor regressions
+ + `diagnose` — missing rates, duplicates, suspicious values
- - Demos to how this integrates into standard estimations
- + See how to estimate asset pricing betas in the [Estimating Stock Betas](@ref) demo.
- + Build general queries for the WRDS postgres in [Advanced WRDS](@ref)
+ - Demos
+ + [Estimating Stock Betas](@ref) — unconditional and rolling betas
+ + [Advanced WRDS](@ref) — custom Postgres queries
## Other Resources
diff --git a/docs/src/man/yield_curve_gsw.md b/docs/src/man/yield_curve_gsw.md
@@ -174,7 +174,7 @@ The package automatically detects which model to use based on available paramete
## Missing Data Handling
-- Automatically converts `-999` flag values to `missing`
+- Automatically converts common flag values to `missing`: `-999.99`, `-999`, `-9999`, `-99.99`
- Gracefully handles periods with missing τ₂/β₃ parameters
- Propagates missing values through calculations appropriately
diff --git a/src/ImportCRSP.jl b/src/ImportCRSP.jl
@@ -40,6 +40,7 @@ function import_MSF(wrds_conn::Connection;
variables::Vector{String} = [""]
)
+ date_range = _validate_date_range(date_range)
# -- GETTING COLUMN NAMES
# download potential columns
@@ -293,11 +294,11 @@ Import the CRSP Monthly Stock File (MSF) from CRSP on WRDS PostGres server from
function import_MSF_v2(wrds_conn::Connection;
date_range::Tuple{Date, Date} = (Date("1900-01-01"), Dates.today()),
variables::Vector{String} = [""],
- logging_level::Symbol = :debug, # either none, debug, info etc... tbd
+ logging_level::Symbol = :debug, # either none, debug, info etc... tbd
)
+ date_range = _validate_date_range(date_range)
-
# ----------------------------------------------------------------------------------------------
# the easy way
@debug "Getting monthly stock file (CIZ) ... msf_v2"
@@ -445,9 +446,10 @@ end
# --------------------------------------------------------------------------------------------------
function import_DSF(wrds_conn::Connection;
date_range::Tuple{Date, Date} = (Date("1900-01-01"), Dates.today()),
- logging_level::Symbol = :debug, # either none, debug, info etc... tbd
+ logging_level::Symbol = :debug, # either none, debug, info etc... tbd
)
+ date_range = _validate_date_range(date_range)
# set up the query for msf
postgre_query_dsf = """
@@ -480,9 +482,10 @@ end
# --------------------------------------------------------------------------------------------------
function import_DSF_v2(wrds_conn::Connection;
date_range::Tuple{Date, Date} = (Date("1900-01-01"), Dates.today()),
- logging_level::Symbol = :debug, # either none, debug, info etc... tbd
+ logging_level::Symbol = :debug, # either none, debug, info etc... tbd
)
+ date_range = _validate_date_range(date_range)
# could pick either way ...
# dsf_columns = _get_postgres_columns("crsp", "dsf_v2"; wrds_conn=wrds_conn) |> sort
diff --git a/src/ImportComp.jl b/src/ImportComp.jl
@@ -38,17 +38,19 @@ function import_Funda(wrds_conn::Connection;
filter_variables=Dict(:CURCD => "USD") # if you want something fanciers ... export variable and do it later
)
+ date_range = _validate_date_range(date_range; earliest=Date("1950-01-01"))
+
var_funda = ["GVKEY", "DATADATE", "SICH", "FYR", "FYEAR",
"AT", "LT", "SALE", "EBITDA", "CAPX", "NI", "DV", "CEQ", "CEQL", "SEQ",
"TXDITC", "TXP", "TXDB", "ITCB", "DVT", "PSTK", "PSTKL", "PSTKRV"]
!isnothing(variables) && append!(var_funda, uppercase.(variables))
!isnothing(filter_variables) && append!(var_funda, uppercase.(string.(keys(filter_variables))))
- # TODO WE SHOULD PROBABLY KEEP SOMEWHERE AS DATA THE LIST OF VALID COLUMNS
- # THEN THROW A WARNING IF IT DOESNT FIT
- var_check = setdiff(var_funda, compd_funda)
- size(var_check, 1) > 0 && (@warn "Queried variables not in dataset ... : $(join(var_check, ","))")
- filter!(in(compd_funda), var_funda)
+ # Validate variables against actual schema (falls back to hardcoded list)
+ valid_cols = _get_funda_columns(wrds_conn)
+ var_check = setdiff(var_funda, valid_cols)
+ !isempty(var_check) && (@warn "Queried variables not in comp.funda: $(join(var_check, ", "))")
+ filter!(in(valid_cols), var_funda)
# set up the query for funda (dates parameterized to prevent SQL injection)
postgre_query_funda = """
@@ -159,6 +161,33 @@ end
# ------------------------------------------------------------------------------------------
+# Session cache for funda columns — queried once per session, falls back to hardcoded list
+const _funda_columns_cache = Ref{Vector{String}}()
+
+"""
+ _get_funda_columns(wrds_conn) -> Vector{String}
+
+Get valid column names for comp.funda. Queries the schema at runtime and caches
+for the session. Falls back to the hardcoded `compd_funda` list on error.
+"""
+function _get_funda_columns(wrds_conn)
+ if isassigned(_funda_columns_cache)
+ return _funda_columns_cache[]
+ end
+ try
+ cols = _get_postgres_columns("comp", "funda"; wrds_conn=wrds_conn)
+ _funda_columns_cache[] = cols
+ return cols
+ catch e
+ @warn "Could not query comp.funda schema, using hardcoded variable list" exception=(e, catch_backtrace())
+ return compd_funda
+ end
+end
+# ------------------------------------------------------------------------------------------
+
+
+# ------------------------------------------------------------------------------------------
+# Hardcoded fallback list of comp.funda columns (last verified: 2025-02)
const compd_funda = [
"GVKEY", "DATADATE", "FYEAR", "INDFMT", "CONSOL", "POPSRC", "DATAFMT", "TIC", "CUSIP", "CONM",
"ACCTCHG", "ACCTSTD", "ACQMETH", "ADRR", "AJEX", "AJP", "BSPR", "COMPST", "CURCD", "CURNCD",
diff --git a/src/Utilities.jl b/src/Utilities.jl
@@ -78,4 +78,23 @@ end
# --------------------------------------------------------------------------------------------------
+# --------------------------------------------------------------------------------------------------
+"""
+ _validate_date_range(date_range; earliest, latest) -> Tuple{Date, Date}
+
+Validate and normalize a date range. Swaps reversed dates, warns on suspicious ranges.
+"""
+function _validate_date_range(date_range::Tuple{Date, Date};
+ earliest::Date=Date("1925-01-01"),
+ latest::Date=Dates.today() + Month(6))
+ start_date, end_date = date_range
+ if start_date > end_date
+ @warn "Start date after end date, swapping" start_date end_date
+ start_date, end_date = end_date, start_date
+ end
+ start_date < earliest && @warn "Start date $start_date is before earliest available data ($earliest)"
+ end_date > latest && @warn "End date $end_date is in the future"
+ return (start_date, end_date)
+end
+# --------------------------------------------------------------------------------------------------