sorcha.readers.ObjectDataReader
Base class for reading object-related data from a variety of sources and returning a pandas data frame.
Each subclass of ObjectDataReader must implement at least the functions _read_rows_internal and _read_objects_internal, both of which return a pandas data frame. Each data source needs to have a column ObjID that identifies the object and can be used for joining and filtering.
Caching is implemented in the base class. This will lazy load the full
table into memory from the chosen data source, so it should only be
used with smaller data sets. Both read_rows and read_objects
will check for a cached table before reading the files, allowing them
to perform direct pandas operations if the data is already in memory.
Classes
The base class for reading in the object data. |
Module Contents
- class ObjectDataReader(cache_table=False, **kwargs)[source]
Bases:
abc.ABCThe base class for reading in the object data.
- abstractmethod get_reader_info()[source]
Return a string identifying the current reader name and input information (for logging and output).
- Returns:
name -- The reader information.
- Return type:
str
- read_rows(block_start=0, block_size=None, **kwargs)[source]
Reads in a set number of rows from the input, performs post-processing and validation, and returns a data frame.
- Parameters:
block_start (int (optional)) -- The 0-indexed row number from which to start reading the data. For example in a CSV file block_start=2 would skip the first two lines after the header and return data starting on row=2. Default=0
block_size (int (optional)) -- the number of rows to read in. Use block_size=None to read in all available data. Default = None
**kwargs (dictionary, optional) -- Extra arguments
- Returns:
res_df -- dataframe of the object data.
- Return type:
Pandas dataframe
- abstractmethod _read_rows_internal(block_start=0, block_size=None, **kwargs)[source]
Function to do the actual source-specific reading.
- read_objects(obj_ids, **kwargs)[source]
Read in a chunk of data corresponding to all rows for a given set of object IDs.
- Parameters:
obj_ids (list) -- A list of object IDs to use.
**kwargs (dictionary, optional) -- Extra arguments
- Returns:
res_df -- The dataframe for the object data.
- Return type:
Pandas dataframe
- abstractmethod _read_objects_internal(obj_ids, **kwargs)[source]
Function to do the actual source-specific reading.
- _validate_object_id_column(input_table)[source]
Checks that the object ID column exists and converts it to a string. This is the common validity check for all object data tables.
- Parameters:
input_table (Pandas dataframe) -- A loaded table.
- Returns:
input_table -- Returns the input dataframe modified in-place.
- Return type:
Pandas dataframe
- _process_and_validate_input_table(input_table, **kwargs)[source]
Perform any input-specific processing and validation on the input table. Modifies the input dataframe in place.
- Parameters:
input_table (Pandas dataframe) -- A loaded table.
**kwargs (dictionary, optional) -- Extra arguments
- Returns:
input_table -- Returns the input dataframe modified in-place.
- Return type:
Pandas dataframe
Notes
The base implementation includes filtering that is common to most input types. Subclasses should call super.process_and_validate() to ensure that the ancestor’s validation is also applied.
Additional arguments to use:
- disallow_nanboolean
if True then checks the data for NaNs or nulls.