Mini-difi vs. brute-force linking tests

This demonstration takes the output from midway through a Sorcha run with the demoset of 100k TNOs before linking is applied.

[ ]:
import pandas as pd
import numpy as np

from astropy.coordinates import SkyCoord
import astropy.units as u

Inputs

[2]:
df = pd.read_csv("midSorcha_TNOs.csv").drop(columns=["Unnamed: 0"]).reset_index()
print(f"Number of objects: {len(df['ssObjectId'].unique())}")
print(f"Number of observations: {len(df)}")
Number of objects: 994
Number of observations: 62354

The “brute force” miniDifi implementation.

This should be a (hopefully) more pedagogically easier to follow implementation of the mock linker algorithm.

The results should match the fast (but harder to follow) implementation in Sorcha.

[3]:
#
# Compute if the current night has observations that will qualify as traclkets
#

def compute_nights(df, night_start_utc_days = 17.0 / 24.0):
    # this computes the integer night (== the MJD of the evening the observations started)
    # for the observations in df.

    tshifted = df["midPointTai"] - night_start_utc_days
    return tshifted.astype(int)

def has_tracklet(df, minlen_arcsec=1, maxdt_minutes=90, min_observations=2):
    # this computes whether a given night has /any/ observations close enough to be
    # made into a tracklet.
    #
    # WARNING: the input dataframe must be for _a single_ object.

    if "ssObjectId" in df.columns:
        assert len(df["ssObjectId"].unique()) == 1, "This function must be run on a single object"

    maxdt = maxdt_minutes / 60. / 24.
    minlen = minlen_arcsec / 3600.
    detections = []

    if len(df)< min_observations:
        return False
    if min_observations == 1:  # for edge case where one observation is needed for a tracklet
        return True
    for i in range(len(df)):
        for j in range(i+1, len(df)):
            t0, t1 = df["midPointTai"].iloc[i], df["midPointTai"].iloc[j]
            ra0, ra1 = df["ra"].iloc[i], df["ra"].iloc[j]
            dec0, dec1 = df["decl"].iloc[i], df["decl"].iloc[j]

            # is this pair too spread out in time?
            dt = t1 - t0
            assert dt > 0, (dt, t1, t0)
            if dt > maxdt:
                continue

            coord1 = SkyCoord(ra=ra0*u.degree, dec=dec0*u.degree, frame='icrs')
            coord2 = SkyCoord(ra=ra1*u.degree, dec=dec1*u.degree, frame='icrs')
            angular_distance = coord1.separation(coord2)
            dist = angular_distance.deg

            # is this pair too close in distance?
            if dist < minlen:
                continue

            # this combo will make a tracklet; return true!
            ##print(f'dt = {dt*24*60:.2f} min, sep = {dist*3600:.2f}"')
            detections.append(i)
            detections.append(j)
            if len(set(detections)) >= min_observations:
                return True

    return False

def nights_with_tracklets(df):
    # returns a list (as ndarray) of nights (MJD) with tracklets

    if "ssObjectId" in df.columns:
        assert len(df["ssObjectId"].unique()) == 1, "This function must be run on a single object"

    # Filter the DataFrame to only keep rows where the 'night' value appears 2 or more times
    df_filtered = df[df.groupby('night')['night'].transform('count') >= 1].reset_index()

    # Now, for each night compute if it has a tracklet by calling has_tracklet()
    nightHasTracklet = df_filtered.groupby("night").apply(has_tracklet, include_groups=False)
    nightsWithTracklets = nightHasTracklet[nightHasTracklet]

    return nightsWithTracklets.index.values

def enumerateOpportunities(nights, window=14, nlink=3):
    #
    # Enumerate all discovery opportunities given the list of nights with tracklets.
    #
    # Note: window is inclusive of the most recent night, and exclusive of the -window night.
    #
    # For example, if window=2, that means we'd search for tracklets tonight and yesteday,
    # but not day before yesterday. I.e., if tonight is the MJD of current night, the condition
    # to include a night into the tracklet suite is:
    #
    #     tonight - window < tracklet_night  and  tracklet_night <= tonight
    #
    # (i.e., it's \lt on the first comparison, not \le).

    # we'll be collecting unique opportunities here
    opportunities = set()
    if len(nights) == 0:
        return opportunities

    # go from the first to last night, and for each find the
    # tracklets in the trailing window
    for t1 in range(nights.min(), nights.max()+window+1):
        t0 = t1 - window # trailing window
        tracklets = tuple(nights[(t0 < nights) & (nights <= t1)])

        # not enough tracklets to link? continue...
        if len(tracklets) < nlink:
            continue

        opportunities.add(tracklets)

    return opportunities

def linkObject(df):
    #
    # find and return the list of all discovery opportunities for a given
    # object.
    #
    # WARNING: the df must be for a single object (!!).
    #

    nightsWithTracklets = nights_with_tracklets(df)
    opps = enumerateOpportunities(nightsWithTracklets)
    return opps

def linkObservations(df):
    #
    # The main driver. Given a list of observations of multiple objects,
    # analyze whether each object can be linked and return the number of
    # linking opportunities.
    #

    df["night"] = compute_nights(df)

    res = df.groupby("ssObjectId").apply(
        lambda df: len(linkObject(df)),
        include_groups=False)
    return res

Tests

Run the brute-force linker. It returns a series with the number of discovery opportunities for each object:

[4]:
%%time
bf = linkObservations(df)
bf
CPU times: user 11.4 s, sys: 36.9 ms, total: 11.5 s
Wall time: 11.5 s
[4]:
ssObjectId
STC001TFa     9
STC001TGa     2
STC001THa    17
STC001TIa     2
STC001TJa     2
             ..
STC0029va     4
STC0029wa     7
STC0029xa     1
STC0029ya    14
STC0029za    14
Length: 994, dtype: int64

Now run the Sorcha miniDifi implementation.

Note: if you’re paying attention to runtime, run this cell at least twice as the first time you’re running there will be lots of overhead with module loading and JIT compilation.

[5]:
%%time
import sorcha.modules.PPMiniDifi as md

df_all = pd.read_csv("midSorcha_TNOs.csv").drop(columns=["Unnamed: 0"]).reset_index()

# Convert to ndarray
nameLen = df_all["ssObjectId"].str.len().max()
obsv = np.asarray(
        df_all.to_records(
            index=False,
            column_dtypes=dict(_name=f"a{nameLen}", diaSourceId="u8", midPointTai="f8", ra="f8", decl="f8"),
        )
    )

# run minidifi
minidifi = md.linkObservations(obsv, seed=42, p=1, objectId="ssObjectId", maxdt_minutes=90, minlen_arcsec=1, min_observations=2, window=14, nlink=3, night_start_utc_days=17.0 / 24.0)
CPU times: user 1.1 s, sys: 190 ms, total: 1.29 s
Wall time: 1.47 s

Comparison

If the two implementations match, there should be no assertions and an the dataframe of differences will be empty in the end.

[6]:
bf_res = pd.DataFrame({"bf_discoveryChances": bf})
md_res = pd.DataFrame(minidifi).set_index("ssObjectId")[["discoveryChances"]]

# assert all objects are there in both dataframes
assert np.all(bf_res.index.values == md_res.index.values)
print(f"Total number of objects: {len(bf_res)}")

# join the two dataframes
res = bf_res.join(md_res)
res["equal"] = res["bf_discoveryChances"] == res["discoveryChances"]
print(f"Total number of equal objects: {np.count_nonzero(res['equal'])}")

# find where they differ
print("Differences between miniDifi and brute force (note: empty is good!):")
res[~res["equal"]]
Total number of objects: 994
Total number of equal objects: 994
Differences between miniDifi and brute force (note: empty is good!):
[6]:
bf_discoveryChances discoveryChances equal
ssObjectId