Back all `AllFires` and `Fire` objects with 2 dataframes (!67) · Merge requests · gcorradini / fireatlas_nrt

Merged Julia Signell requested to merge preprocess-db into primarykeyv2 Feb 02, 2024

This branch uses two dataframes: an allpixels dataframe with 1 row per pixel and an allfires geodataframe with one row per-fire/per-t. The core concept is that if you use a dataframe to back the allfires and fire objects there are well-defined ways to serialize that to disk whenever you like (aka no more pickles!).

Here's a bit of an overview of the lifecycle of each of these dataframes:

allpixels:

At the start of Fire_Forward all of the preprocessed pixel data is loaded and concatenated into one long dataframe.
Each row represents a fire pixel and there is a unique id per row.
As Fire_Forward iterates through the timesteps of interest the allpixels dataframe is updated in place.
Each Fire object refers to the allpixels object as the source of truth and does not hold pixel data but instead refers to subsets of the allpixels dataframe to return n_pixels or newpixels.
Merging fires at a particular t can update the allpixels at a former timestep.
When Fire_Forward is complete, the allpixels object can be serialized to csv (or any tabular format) optionally partitioned into files by t.
This dataframe can be used:
- together with allfires_gdf to rehydrate the allfires object at the latest t in order to run Fire_Forward on one new ingest file.
- independently to write the nplist output file for largefires

allfires_gdf:

At the start of Fire_Forward a new geodataframe object is initialized. It has a column for each of the Fire attributes that take a non-trivial amount of time to compute (ftype, hull, fline...).
As Fire_Forward iterates through the timesteps of interest it writes a row for every fire that is burning (aka has new pixels) at the t.
So each row contains the information about one fire at one t. The index is a MultiIndex of (fid, t)
Merging fires at a particular t updates the mergeid on the existing rows (this part I am not totally confident is correct).
When Fire_Forward is complete, the allfires_gdf object can be serialized to geoparquet (this is the best choice since it contains multiple geometry columns) optionally partitioned into files by t.
This geodataframe can be used:
- together with allpixels to rehydrate the allfires object at the latest t in order to run Fire_Forward on one new ingest file.
- independently to write all the snapshot and largefires output files.

Side note: I like that in this branch the allpixels dataframe is referenced by all the Fire objects but it isn't copied around. This is different from how it works in preprocess where each Fire object (at each t) has its own dataframe. It is also different than the original version of this algorithm where each Fire object (at each t) holds a bunch of lists.