FinanceRoutines.jl

Financial data routines for Julia
Log | Files | Refs | README | LICENSE

commit 5d9906c5c6df32a4581032675430431bb7c130b6
parent 49dd8afd730dec99c8ce647ac17470c162469ec1
Author: Erik Loualiche <[email protected]>
Date:   Sun, 22 Mar 2026 13:57:44 -0500

Add date range validation, dynamic Compustat schema, update docs

- _validate_date_range: warns on reversed/ancient/future dates, called
  in all WRDS import functions
- _get_funda_columns: queries comp.funda schema at runtime, caches per
  session, falls back to hardcoded list on error
- README: document FF5, momentum, portfolio returns, diagnose
- docs/src/index.md: rewrite usage section with all new features
- docs/src/man/yield_curve_gsw.md: update missing flag list

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Diffstat:
MREADME.md | 49++++++++++++++++++++++++++++++++++++-------------
Mdocs/src/index.md | 33+++++++++++++++++++++------------
Mdocs/src/man/yield_curve_gsw.md | 2+-
Msrc/ImportCRSP.jl | 11+++++++----
Msrc/ImportComp.jl | 39++++++++++++++++++++++++++++++++++-----
Msrc/Utilities.jl | 19+++++++++++++++++++
6 files changed, 118 insertions(+), 35 deletions(-)

diff --git a/README.md b/README.md @@ -7,12 +7,14 @@ `FinanceRoutines.jl` is a package that contains useful functions to download and process academic financial data. -So far the package provides function to import: +The package provides functions to: - - CRSP and Compustat from the WRDS Postgres server - - Fama-French three factors series from Ken French's website - - GSW Yield curves from the [NY Fed](https://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html) - - Estimation of betas for stocks + - Import CRSP and Compustat from the WRDS Postgres server + - Import Fama-French 3-factor, 5-factor, and momentum series from Ken French's website + - Import GSW yield curves from the [Federal Reserve](https://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html) and compute bond returns + - Estimate rolling betas for stocks + - Calculate equal-weighted and value-weighted portfolio returns + - Run data quality diagnostics on financial DataFrames ## Installation @@ -73,14 +75,20 @@ df_msf = link_MSF(df_linktable, df_msf = innerjoin(df_msf, df_funda, on = [:gvkey, :datey], matchmissing=:notequal) ``` -### Import the Fama-French three factors +### Import Fama-French factors -This downloads directly data from Ken French's website and formats the data +Download directly from Ken French's website. Supports 3-factor, 5-factor, and momentum at daily/monthly/annual frequency. ```julia +# 3-factor model df_FF3 = import_FF3() -# there is an option to download the daily factors df_FF3_daily = import_FF3(frequency=:daily) + +# 5-factor model (adds profitability RMW and investment CMA) +df_FF5 = import_FF5() + +# Momentum factor +df_mom = import_FF_momentum() ``` ### Estimate treasury bond returns @@ -103,15 +111,30 @@ transform!(df_GSW, See the [doc](https://louloulibs.github.io/FinanceRoutines.jl/) and tests for more options. -### Common operations in asset pricing +### Portfolio returns -Look in the documentation for a guide on how to estimate betas: over the whole sample and using rolling regressions. -The package exports the function `calculate_rolling_betas`. +```julia +# Equal-weighted portfolio returns by date +df_ew = calculate_portfolio_returns(df_msf, :ret, :datem; weighting=:equal) + +# Value-weighted by market cap, grouped by size quintile +df_vw = calculate_portfolio_returns(df_msf, :ret, :datem; + weighting=:value, weight_col=:mktcap, groups=:size_quintile) +``` + +### Data diagnostics +```julia +report = diagnose(df_msf) +report[:missing_rates] # fraction missing per column +report[:duplicate_keys] # duplicate (permno, date) pairs +report[:suspicious_values] # extreme returns, negative prices +``` -## To Do +### Common operations in asset pricing - - `olsgmm` from cochrane GMM code +Look in the documentation for a guide on how to estimate betas: over the whole sample and using rolling regressions. +The package exports `calculate_rolling_betas`. ## Other references to work with financial data diff --git a/docs/src/index.md b/docs/src/index.md @@ -4,12 +4,10 @@ ## Introduction -This package provides a collection of routines for academic finance work. -This is useful to get started with a clean copy of asset prices from CRSP and a ad-hoc merge with the accounting data from the Compustat Funda file. +This package provides a collection of routines for academic finance work. +It gives you a clean pipeline from raw WRDS data (CRSP, Compustat) through to standard research datasets, plus tools for Fama-French factors, treasury yield curves, portfolio construction, and data diagnostics. -I have also added utilities to download treasury yield curves (GSW) and Fama-French research factors. - -This is still very much work in progress: file [issues](https://github.com/louloulibs/FinanceRoutines.jl/issues) for comments. +File [issues](https://github.com/louloulibs/FinanceRoutines.jl/issues) for comments. ## Installation @@ -31,16 +29,27 @@ Pkg.add("https://github.com/louloulibs/FinanceRoutines.jl") ## Usage - - Using WRDS (CRSP, Compustat, etc) - + See the [WRDS User Guide](@ref) for an introduction to using the package to download data from WRDS - + See [Transitioning to the new CRSP file format](@ref) for a guide on converting from SIZ to CIZ + - WRDS (CRSP, Compustat) + + [WRDS User Guide](@ref) — download and merge CRSP/Compustat data + + [Transitioning to the new CRSP file format](@ref) — SIZ to CIZ migration + + - Fama-French factors + + `import_FF3()` — 3-factor model (market, size, value) + + `import_FF5()` — 5-factor model (adds profitability, investment) + + `import_FF_momentum()` — momentum factor + + All support `:daily`, `:monthly`, `:annual` frequencies - Treasury yield curves - + See the [Import Yield Curve Data](@ref) guide for GSW yield curve parameters and bond return calculations + + [Import Yield Curve Data](@ref) — GSW parameters, yields, prices, bond returns + + - Portfolio analytics + + `calculate_portfolio_returns` — equal/value-weighted returns with optional grouping + + `calculate_rolling_betas` — rolling window factor regressions + + `diagnose` — missing rates, duplicates, suspicious values - - Demos to how this integrates into standard estimations - + See how to estimate asset pricing betas in the [Estimating Stock Betas](@ref) demo. - + Build general queries for the WRDS postgres in [Advanced WRDS](@ref) + - Demos + + [Estimating Stock Betas](@ref) — unconditional and rolling betas + + [Advanced WRDS](@ref) — custom Postgres queries ## Other Resources diff --git a/docs/src/man/yield_curve_gsw.md b/docs/src/man/yield_curve_gsw.md @@ -174,7 +174,7 @@ The package automatically detects which model to use based on available paramete ## Missing Data Handling -- Automatically converts `-999` flag values to `missing` +- Automatically converts common flag values to `missing`: `-999.99`, `-999`, `-9999`, `-99.99` - Gracefully handles periods with missing τ₂/β₃ parameters - Propagates missing values through calculations appropriately diff --git a/src/ImportCRSP.jl b/src/ImportCRSP.jl @@ -40,6 +40,7 @@ function import_MSF(wrds_conn::Connection; variables::Vector{String} = [""] ) + date_range = _validate_date_range(date_range) # -- GETTING COLUMN NAMES # download potential columns @@ -293,11 +294,11 @@ Import the CRSP Monthly Stock File (MSF) from CRSP on WRDS PostGres server from function import_MSF_v2(wrds_conn::Connection; date_range::Tuple{Date, Date} = (Date("1900-01-01"), Dates.today()), variables::Vector{String} = [""], - logging_level::Symbol = :debug, # either none, debug, info etc... tbd + logging_level::Symbol = :debug, # either none, debug, info etc... tbd ) + date_range = _validate_date_range(date_range) - # ---------------------------------------------------------------------------------------------- # the easy way @debug "Getting monthly stock file (CIZ) ... msf_v2" @@ -445,9 +446,10 @@ end # -------------------------------------------------------------------------------------------------- function import_DSF(wrds_conn::Connection; date_range::Tuple{Date, Date} = (Date("1900-01-01"), Dates.today()), - logging_level::Symbol = :debug, # either none, debug, info etc... tbd + logging_level::Symbol = :debug, # either none, debug, info etc... tbd ) + date_range = _validate_date_range(date_range) # set up the query for msf postgre_query_dsf = """ @@ -480,9 +482,10 @@ end # -------------------------------------------------------------------------------------------------- function import_DSF_v2(wrds_conn::Connection; date_range::Tuple{Date, Date} = (Date("1900-01-01"), Dates.today()), - logging_level::Symbol = :debug, # either none, debug, info etc... tbd + logging_level::Symbol = :debug, # either none, debug, info etc... tbd ) + date_range = _validate_date_range(date_range) # could pick either way ... # dsf_columns = _get_postgres_columns("crsp", "dsf_v2"; wrds_conn=wrds_conn) |> sort diff --git a/src/ImportComp.jl b/src/ImportComp.jl @@ -38,17 +38,19 @@ function import_Funda(wrds_conn::Connection; filter_variables=Dict(:CURCD => "USD") # if you want something fanciers ... export variable and do it later ) + date_range = _validate_date_range(date_range; earliest=Date("1950-01-01")) + var_funda = ["GVKEY", "DATADATE", "SICH", "FYR", "FYEAR", "AT", "LT", "SALE", "EBITDA", "CAPX", "NI", "DV", "CEQ", "CEQL", "SEQ", "TXDITC", "TXP", "TXDB", "ITCB", "DVT", "PSTK", "PSTKL", "PSTKRV"] !isnothing(variables) && append!(var_funda, uppercase.(variables)) !isnothing(filter_variables) && append!(var_funda, uppercase.(string.(keys(filter_variables)))) - # TODO WE SHOULD PROBABLY KEEP SOMEWHERE AS DATA THE LIST OF VALID COLUMNS - # THEN THROW A WARNING IF IT DOESNT FIT - var_check = setdiff(var_funda, compd_funda) - size(var_check, 1) > 0 && (@warn "Queried variables not in dataset ... : $(join(var_check, ","))") - filter!(in(compd_funda), var_funda) + # Validate variables against actual schema (falls back to hardcoded list) + valid_cols = _get_funda_columns(wrds_conn) + var_check = setdiff(var_funda, valid_cols) + !isempty(var_check) && (@warn "Queried variables not in comp.funda: $(join(var_check, ", "))") + filter!(in(valid_cols), var_funda) # set up the query for funda (dates parameterized to prevent SQL injection) postgre_query_funda = """ @@ -159,6 +161,33 @@ end # ------------------------------------------------------------------------------------------ +# Session cache for funda columns — queried once per session, falls back to hardcoded list +const _funda_columns_cache = Ref{Vector{String}}() + +""" + _get_funda_columns(wrds_conn) -> Vector{String} + +Get valid column names for comp.funda. Queries the schema at runtime and caches +for the session. Falls back to the hardcoded `compd_funda` list on error. +""" +function _get_funda_columns(wrds_conn) + if isassigned(_funda_columns_cache) + return _funda_columns_cache[] + end + try + cols = _get_postgres_columns("comp", "funda"; wrds_conn=wrds_conn) + _funda_columns_cache[] = cols + return cols + catch e + @warn "Could not query comp.funda schema, using hardcoded variable list" exception=(e, catch_backtrace()) + return compd_funda + end +end +# ------------------------------------------------------------------------------------------ + + +# ------------------------------------------------------------------------------------------ +# Hardcoded fallback list of comp.funda columns (last verified: 2025-02) const compd_funda = [ "GVKEY", "DATADATE", "FYEAR", "INDFMT", "CONSOL", "POPSRC", "DATAFMT", "TIC", "CUSIP", "CONM", "ACCTCHG", "ACCTSTD", "ACQMETH", "ADRR", "AJEX", "AJP", "BSPR", "COMPST", "CURCD", "CURNCD", diff --git a/src/Utilities.jl b/src/Utilities.jl @@ -78,4 +78,23 @@ end # -------------------------------------------------------------------------------------------------- +# -------------------------------------------------------------------------------------------------- +""" + _validate_date_range(date_range; earliest, latest) -> Tuple{Date, Date} + +Validate and normalize a date range. Swaps reversed dates, warns on suspicious ranges. +""" +function _validate_date_range(date_range::Tuple{Date, Date}; + earliest::Date=Date("1925-01-01"), + latest::Date=Dates.today() + Month(6)) + start_date, end_date = date_range + if start_date > end_date + @warn "Start date after end date, swapping" start_date end_date + start_date, end_date = end_date, start_date + end + start_date < earliest && @warn "Start date $start_date is before earliest available data ($earliest)" + end_date > latest && @warn "End date $end_date is in the future" + return (start_date, end_date) +end +# --------------------------------------------------------------------------------------------------