BazerUtils.jl

Assorted Julia utilities including custom logging
Log | Files | Refs | README | LICENSE

read_html_tables.md (2772B)


      1 # Reading HTML Tables
      2 
      3 Parse HTML tables into DataFrames — a Julia-native replacement for pandas' `read_html`.
      4 
      5 ---
      6 
      7 ## Quick Start
      8 
      9 ```julia
     10 using BazerUtils
     11 
     12 # From a URL
     13 dfs = read_html_tables("https://en.wikipedia.org/wiki/List_of_Alabama_state_parks")
     14 
     15 # From a raw HTML string
     16 dfs = read_html_tables("<table><tr><th>A</th></tr><tr><td>1</td></tr></table>")
     17 ```
     18 
     19 `read_html_tables` returns a `Vector{DataFrame}` — one per `<table>` element found.
     20 
     21 ---
     22 
     23 ## API
     24 
     25 ```@docs
     26 read_html_tables
     27 ```
     28 
     29 ---
     30 
     31 ## Keyword Arguments
     32 
     33 ### `match`
     34 
     35 Pass a `Regex` to keep only tables whose text content matches:
     36 
     37 ```julia
     38 dfs = read_html_tables(url; match=r"Population"i)
     39 ```
     40 
     41 ### `flatten`
     42 
     43 Controls how multi-level headers (multiple `<thead>` rows) become column names.
     44 DataFrames requires `String` column names, so multi-level tuples are flattened:
     45 
     46 | Value | Column name example | Description |
     47 |:------|:--------------------|:------------|
     48 | `nothing` (default) | `"(Region, Name)"` | Tuple string representation |
     49 | `:join` | `"Region_Name"` | Levels joined with `_` |
     50 | `:last` | `"Name"` | Last header level only |
     51 
     52 ```julia
     53 dfs = read_html_tables(html; flatten=:join)
     54 ```
     55 
     56 ---
     57 
     58 ## How It Works
     59 
     60 1. **Fetch**: URLs (starting with `http`) are downloaded via `HTTP.jl`; raw strings are parsed directly.
     61 2. **Parse**: HTML is parsed with `Gumbo.jl`; `<table>` elements are selected with `Cascadia.jl`.
     62 3. **Classify rows**: `<thead>` rows become headers, `<tbody>`/`<tfoot>` rows become body data. Without an explicit `<thead>`, consecutive all-`<th>` rows at the top are promoted to headers.
     63 4. **Expand spans**: `colspan` and `rowspan` attributes are expanded into a dense grid (same algorithm as pandas' `_expand_colspan_rowspan`).
     64 5. **Build DataFrame**: Empty cells become `missing`. Duplicate column names get `.1`, `.2` suffixes.
     65 
     66 ---
     67 
     68 ## Examples
     69 
     70 ### Filter tables by content
     71 
     72 ```julia
     73 # Only tables mentioning "GDP"
     74 dfs = read_html_tables(url; match=r"GDP"i)
     75 ```
     76 
     77 ### Multi-level headers
     78 
     79 ```julia
     80 html = """
     81 <table>
     82   <thead>
     83     <tr><th colspan="2">Region</th></tr>
     84     <tr><th>Name</th><th>Pop</th></tr>
     85   </thead>
     86   <tbody>
     87     <tr><td>East</td><td>100</td></tr>
     88   </tbody>
     89 </table>
     90 """
     91 
     92 read_html_tables(html; flatten=:join)
     93 # 1×2 DataFrame: columns "Region_Name", "Region_Pop"
     94 ```
     95 
     96 ### Tables with colspan/rowspan
     97 
     98 Spanned cells are duplicated into every position they cover, so the resulting DataFrame has a regular rectangular shape with no gaps.
     99 
    100 ---
    101 
    102 ## See Also
    103 
    104 - [`Gumbo.jl`](https://github.com/JuliaWeb/Gumbo.jl): HTML parser
    105 - [`Cascadia.jl`](https://github.com/Algocircle/Cascadia.jl): CSS selector engine
    106 - [`HTTP.jl`](https://github.com/JuliaWeb/HTTP.jl): HTTP client
    107 - [`DataFrames.jl`](https://github.com/JuliaData/DataFrames.jl): Tabular data