BazerData.jl

Data manipulation utilities for Julia
Log | Files | Refs | README | LICENSE

index.md (4883B)


      1 # BazerData.jl
      2 
      3 Useful functions for working with data: `BazerData.jl` is a placeholder package for some functions that I use in julia frequently.
      4 
      5 So far the package provides a five functions
      6 
      7   1. tabulate some data ([`tabulate`](#tabulate-data))
      8   2. create category based on quantile ([`xtile`](#xtile))
      9   3. winsorize some data ([`winsorize`](#winsorize-data))
     10   4. fill unbalanced panel data ([`panel_fill`](#filling-an-unbalanced-panel))
     11   5. lead and lag functions ([`tlead|tlag`](#leads-and-lags))
     12 
     13 Note that as the package grow in different directions, dependencies might become overwhelming.
     14 The readme serves as documentation; there might be more examples inside of the test folder.
     15 
     16 ## Installation
     17 
     18 `BazerData.jl` is a registered package.
     19 You can install it via
     20 ```julia
     21 import Pkg
     22 Pkg.add("BazerData")
     23 ```
     24 
     25 
     26 ## Usage
     27 
     28 ### Tabulate data
     29 
     30 The `tabulate` function tries to emulate the tabulate function from stata (see oneway [here](https://www.stata.com/manuals/rtabulateoneway.pdf) or twoway [here](https://www.stata.com/manuals13/rtabulatetwoway.pdf)).
     31 This relies on the `DataFrames.jl` package and is useful to get a quick overview of the data.
     32 
     33 ```julia
     34 using DataFrames
     35 using BazerData
     36 using PalmerPenguins
     37 
     38 df = DataFrame(PalmerPenguins.load())
     39 
     40 tabulate(df, :island)
     41 tabulate(df, [:island, :species])
     42 
     43 # If you are looking for groups by type (detect missing e.g.)
     44 df = DataFrame(x = [1, 2, 2, "NA", missing], y = ["c", "c", "b", "z", "d"])
     45 tabulate(df, [:x, :y], group_type = :type) # only types for all group variables
     46 tabulate(df, [:x, :y], group_type = [:value, :type]) # mix value and types
     47 ```
     48 I have not implemented all the features of the stata tabulate function, but I am open to suggestions.
     49 
     50 
     51 
     52 ### xtile
     53 
     54 See the [doc](https://eloualiche.github.io/BazerData.jl/dev/man/xtile_guide) or the [tests](https://github.com/eloualiche/BazerData.jl/blob/main/test/UnitTests/xtile.jl) for examples.
     55 ```julia
     56 sales = rand(10_000);
     57 a = xtile(sales, 10);
     58 b = xtile(sales, 10, weights=Weights(repeat([1], length(sales))) );
     59 # works on strings
     60 cities = [randstr() for _ in 10]
     61 xtile(cities, 10)
     62 ```
     63 
     64 
     65 ### Winsorize data
     66 
     67 See the doc for [examples](https://eloualiche.github.io/BazerData.jl/dev/man/winsorize_guide)
     68 
     69 This is fairly standard and I offer options to specify probabilities or cutpoints; moreover you can replace the values that are winsorized with a missing, the cutpoints, or some specific values.
     70 There is a [`winsor`](https://juliastats.org/StatsBase.jl/stable/robust/#StatsBase.winsor) function in StatsBase.jl but I think it's a little less full-featured.
     71 
     72 See the doc for [examples](https://eloualiche.github.io/BazerData.jl/dev/man/winsorize_guide)
     73 ```julia
     74 df = DataFrame(PalmerPenguins.load())
     75 winsorize(df.flipper_length_mm, probs=(0.05, 0.95)) # skipmissing by default
     76 transform(df, :flipper_length_mm =>
     77     (x->winsorize(x, probs=(0.05, 0.95), replace_value=missing)), renamecols=false)
     78 ```
     79 
     80 
     81 ### Filling an unbalanced panel
     82 
     83 Sometimes it is unpractical to work with unbalanced panel data.
     84 There are many ways to fill values between dates (what interpolation to use) and I try to implement a few of them.
     85 I use the function sparingly, so it has not been tested extensively.
     86 
     87 See the following example (or the test suite) for more information.
     88 ```julia
     89 df_panel = DataFrame(        # missing t=2 for id=1
     90     id = ["a","a", "b","b", "c","c","c", "d","d","d","d"],
     91     t  = [Date(1990, 1, 1), Date(1990, 4, 1), Date(1990, 8, 1), Date(1990, 9, 1),
     92           Date(1990, 1, 1), Date(1990, 2, 1), Date(1990, 4, 1),
     93           Date(1999, 11, 10), Date(1999, 12, 21), Date(2000, 2, 5), Date(2000, 4, 1)],
     94     v1 = [1,1, 1,6, 6,0,0, 1,4,11,13],
     95     v2 = [1,2,3,6,6,4,5, 1,2,3,4],
     96     v3 = [1,5,4,6,6,15,12.25, 21,22.5,17.2,1])
     97 
     98 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3],
     99     gap=Month(1), method=:backwards, uniquecheck=true, flag=true)
    100 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3],
    101     gap=Month(1), method=:forwards, uniquecheck=true, flag=true)
    102 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3],
    103     gap=Month(1), method=:linear, uniquecheck=true, flag=true)
    104 ```
    105 
    106 ### Leads and lags
    107 This is largely "borrowed" (copied) from @FuZhiyu [`PanelShift.jl`](https://github.com/FuZhiyu/PanelShift.jl) package.
    108 See the tests for more examples.
    109 
    110 ```julia
    111 x, t = [1, 2, 3], [1, 2, 4]
    112 tlag(x, t) 
    113 tlag(x, t, n=2) 
    114 
    115 using Dates;
    116 t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)];
    117 tlag(x, t)
    118 tlag(x, t, n=Day(2)) # specify two-day lags
    119 ```
    120 
    121 
    122 ## Other stuff
    123 
    124 
    125 See my other package 
    126   - [BazerUtils.jl](https://github.com/eloualiche/BazerUtils.jl) which groups together data wrangling functions.
    127   - [FinanceRoutines.jl](https://github.com/eloualiche/FinanceRoutines.jl) which is more focused and centered on working with financial data.
    128   - [TigerFetch.jl](https://github.com/eloualiche/TigerFetch.jl) which simplifies downloading shape files from the Census.