sorcha.readers.ObjectDataReader

Base class for reading object-related data from a variety of sources and returning a pandas data frame.

Each subclass of ObjectDataReader must implement at least the functions _read_rows_internal and _read_objects_internal, both of which return a pandas data frame. Each data source needs to have a column ObjID that identifies the object and can be used for joining and filtering.

Caching is implemented in the base class. This will lazy load the full table into memory from the chosen data source, so it should only be used with smaller data sets. Both read_rows and read_objects will check for a cached table before reading the files, allowing them to perform direct pandas operations if the data is already in memory.

Classes

ObjectDataReader

The base class for reading in the object data.

Module Contents

class ObjectDataReader(cache_table=False, **kwargs)[source]

Bases: abc.ABC

The base class for reading in the object data.

_cache_table = False[source]

_table = None[source]

abstractmethod get_reader_info()[source]

Return a string identifying the current reader name and input information (for logging and output).

Returns:: name -- The reader information.
Return type:: str

read_rows(block_start=0, block_size=None, **kwargs)[source]

Reads in a set number of rows from the input, performs post-processing and validation, and returns a data frame.

Parameters:

block_start (int (optional)) -- The 0-indexed row number from which to start reading the data. For example in a CSV file block_start=2 would skip the first two lines after the header and return data starting on row=2. Default=0
block_size (int (optional)) -- the number of rows to read in. Use block_size=None to read in all available data. Default = None
**kwargs (dictionary, optional) -- Extra arguments

Returns:

res_df -- dataframe of the object data.

Return type:

Pandas dataframe

abstractmethod _read_rows_internal(block_start=0, block_size=None, **kwargs)[source]: Function to do the actual source-specific reading.

read_objects(obj_ids, **kwargs)[source]

Read in a chunk of data corresponding to all rows for a given set of object IDs.

Parameters:

obj_ids (list) -- A list of object IDs to use.
**kwargs (dictionary, optional) -- Extra arguments

Returns:

res_df -- The dataframe for the object data.

Return type:

Pandas dataframe

abstractmethod _read_objects_internal(obj_ids, **kwargs)[source]: Function to do the actual source-specific reading.

_validate_object_id_column(input_table)[source]

Checks that the object ID column exists and converts it to a string. This is the common validity check for all object data tables.

Parameters:: input_table (Pandas dataframe) -- A loaded table.
Returns:: input_table -- Returns the input dataframe modified in-place.
Return type:: Pandas dataframe

_process_and_validate_input_table(input_table, **kwargs)[source]

Perform any input-specific processing and validation on the input table. Modifies the input dataframe in place.

Parameters:

input_table (Pandas dataframe) -- A loaded table.
**kwargs (dictionary, optional) -- Extra arguments

Returns:

input_table -- Returns the input dataframe modified in-place.

Return type:

Pandas dataframe

Notes

The base implementation includes filtering that is common to most input types. Subclasses should call super.process_and_validate() to ensure that the ancestor’s validation is also applied.

Additional arguments to use:

disallow_nanboolean: if True then checks the data for NaNs or nulls.