Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable models for sparsely sampled fMRI series #376

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions bids/analysis/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,18 +388,29 @@ def get_design_matrix(self, names=None, format='long', mode='both',

if sampling_rate == 'TR':
trs = {var.run_info[0].tr for var in self.collection.variables.values()}
tas = {var.run_info[0].ta for var in self.collection.variables.values()}
if not trs:
raise ValueError("Repetition time unavailable; specify sampling_rate "
"explicitly")
elif len(trs) > 1:
raise ValueError("Non-unique Repetition times found ({!r}); specify "
"sampling_rate explicitly")
sampling_rate = 1. / trs.pop()
"sampling_rate explicitly".format(trs))
TR = trs.pop()
if not tas:
warnings.warn("Acquisition time unavailable; assuming TA = TR")
tas = {TR}
elif len(tas) > 1:
raise ValueError("Non-unique acquisition times found ({!r})".format(tas))

sampling_rate = 1. / TR
acquisition_time = tas.pop()
elif sampling_rate == 'highest':
sampling_rate = None
dense_df = coll.to_df(names, format='wide',
include_sparse=include_sparse,
sampling_rate=sampling_rate, **kwargs)
sampling_rate=sampling_rate,
integration_window=acquisition_time,
**kwargs)
if dense_df is not None:
dense_df = dense_df.drop(['onset', 'duration'], axis=1)

Expand Down
2 changes: 1 addition & 1 deletion bids/analysis/tests/test_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def sparse_run_variable_with_missing_values():
'duration': [1.2, 1.6, 0.8, 2],
'amplitude': [1, 1, np.nan, 1]
})
run_info = [RunInfo({'subject': '01'}, 20, 2, 'dummy.nii.gz')]
run_info = [RunInfo({'subject': '01'}, 20, 2, 2, 'dummy.nii.gz')]
var = SparseRunVariable('var', data, run_info, 'events')
return BIDSRunVariableCollection([var])

Expand Down
20 changes: 20 additions & 0 deletions bids/analysis/transformations/compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Transformations that primarily involve numerical computation on variables.
'''

from math import gcd
import numpy as np
import pandas as pd
from bids.utils import listify
Expand Down Expand Up @@ -35,6 +36,25 @@ def _transform(self, var, model='spm', derivative=False, dispersion=False,

if isinstance(var, SparseRunVariable):
tyarkoni marked this conversation as resolved.
Show resolved Hide resolved
sr = self.collection.sampling_rate
# Resolve diferences in TR and TA to milliseconds
trs = {ri.tr for ri in var.run_info}
tas = {ri.ta for ri in var.run_info}
assert len(trs) == 1
assert len(tas) == 1
TR = int(np.round(1000. * trs.pop()))
TA = int(np.round(1000. * tas.pop()))
SR = int(np.round(1000. / sr))
if TA is None or TA < TR:
effigies marked this conversation as resolved.
Show resolved Hide resolved
# Use a unit that fits an whole number of times into both
# the interscan interval (TR) and the integration window (TA)
dt = gcd(TR, TA)
if dt > SR:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm probably missing something, but if we have to give up on a value that neatly bins TR and TA, why not just use SR at that point? E.g., if dt = 50 and SR = 10, then this will give us dt = 10, so we're at the SR anyway. If dt = 50 and SR = 17, we get dt = 25, which both prevents neat binning and is bigger than we wanted. Is the idea that at least every dt//SR^th bin will align nicely this way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25 also permits neat binning...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, duh. The inversion tripped me up. That is indeed the thing I was missing. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I'm still not sure we shouldn't just use the SR. Since the user could have set the sr on the collection manually, the compromise solution here risks explicitly ignoring their intentions. This isn't an entirely benign decision, because in subsequent transformations, if a mismatch in sampling rates is detected, that will trigger a resampling in at least one variable, when potentially none might have been needed. They could also have code that depends on knowing what the SR is.

I think maybe we need to come up with a general policy about this, because it crops up in other places too (e.g., the issue about what to do if event files contain very short durations). I.e., the general question is, in a situation where the user appears to be picking sampling rates that are objectively inefficient or result in loss of information, should we be paternalistic and fix it for them, or just let them know they can probably do better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could also do, though it's a bit annoying, is internally maintain a flag that indicates whether or not the user has explicitly overridden the default sampling rate. If SR=10 purely because that's the default, overriding it seems fine. If the user explicitly set it themselves, maybe we don't want to pick a different value for them.

Copy link
Collaborator Author

@effigies effigies Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we're trying to thread a needle where we make smart default choices based on our expertise on the one hand, and on the other, load a gun, cock it and point it at the foot of anyone who looks like they might know what they're doing...

Currently we reserve the footgun for people who use ToDense to get themselves a sampling interval that is relatively prime to gcd(TR, TA). If someone knows enough to manipulate the default sampling rate, they can learn that Convolve with a sparse acquisition paradigm will choose a slightly different interval than with a continuous acquisition paradigm unless the variable is already dense.

But perhaps there's a better general approach here? I admit that I chose this simply because building a boxcar function and taking a mean is easier than learning how to do this with interp1d.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general issue potentially affects anyone who sets the sampling_rate explicitly in BIDSLayout.get_collections or load_variables. A sophisticated user could certainly know enough about their dataset to think "I'll use a sampling rate of 50, because that's high enough to account for my short-duration events, but still manageable computationally". Thereafter, I think that user could reasonably expect that if they call ToDense or Convolve without explicitly specifying a different sampling rate, the resulting variable will have an effective sampling rate of 50.

I think making decisions for the user internally is fine if there's an explicit argument in the transformation that implies as much. I.e., if ToDense has default sampling_rate='auto', then the docstring can just explain that we will act intelligently unless a numeric value is passed, and the result may not equal what's currently set in the collection. That seems fine. The issue is that in this case, there's no provision to specify the sampling rate in Convolve. I was against doing that because I don't think the spec should have that argument, but I guess we can make it a pybids-only argument, if only for the sake of making it clear to a sophisticated user what's going on.

# Find the nearest whole factor of dt >= SR
# This compromises on the target sampling rate and
# one that neatly bins TR and TA
dt = dt // (dt // SR)
sr = 1000. / dt

var = var.to_dense(sr)

df = var.to_df(entities=False)
Expand Down
8 changes: 5 additions & 3 deletions bids/variables/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,23 +36,25 @@ class RunNode(Node):
image_file (str): The full path to the corresponding nifti image.
duration (float): Duration of the run, in seconds.
repetition_time (float): TR for the run.
acquisition_time (float): TA for the run.
task (str): The task name for this run.
'''

def __init__(self, entities, image_file, duration, repetition_time):
def __init__(self, entities, image_file, duration, repetition_time, acquisition_time=None):
self.image_file = image_file
self.duration = duration
self.repetition_time = repetition_time
self.acquisition_time = acquisition_time or repetition_time
super(RunNode, self).__init__('run', entities)

def get_info(self):

return RunInfo(self.entities, self.duration, self.repetition_time,
self.image_file)
self.acquisition_time, self.image_file)


# Stores key information for each Run.
RunInfo = namedtuple('RunInfo', ['entities', 'duration', 'tr', 'image'])
RunInfo = namedtuple('RunInfo', ['entities', 'duration', 'tr', 'ta', 'image'])


class NodeIndex(Node):
Expand Down
29 changes: 26 additions & 3 deletions bids/variables/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,27 @@ def load_variables(layout, types=None, levels=None, skip_empty=True,
return dataset


def _get_timing_info(img_md):
if 'RepetitionTime' in img_md:
tr = img_md['RepetitionTime']
if 'DelayTime' in img_md:
ta = tr - img_md['DelayTime']
elif 'SliceTiming' in img_md:
slicetimes = sorted(img_md['SliceTiming'])
# a, b ... z
# z = final slice onset, b - a = slice duration
ta = np.round(slicetimes[-1] + slicetimes[1] - slicetimes[0], 3)
# If the "silence" is <1ms, consider acquisition continuous
if np.abs(tr - ta) < 1e-3:
ta = tr
else:
ta = tr
elif 'VolumeTiming' in img_md:
return NotImplemented

return tr, ta


def _load_time_variables(layout, dataset=None, columns=None, scan_length=None,
drop_na=True, events=True, physio=True, stim=True,
regressors=True, skip_empty=True, **selectors):
Expand Down Expand Up @@ -141,8 +162,9 @@ def _load_time_variables(layout, dataset=None, columns=None, scan_length=None,
if 'run' in entities:
entities['run'] = int(entities['run'])

tr = layout.get_metadata(img_f, suffix='bold', domains=domains,
full_search=True)['RepetitionTime']
img_md = layout.get_metadata(img_f, suffix='bold', domains=domains,
full_search=True)
tr, ta = _get_timing_info(img_md)

# Get duration of run: first try to get it directly from the image
# header; if that fails, try to get NumberOfVolumes from the
Expand All @@ -162,7 +184,8 @@ def _load_time_variables(layout, dataset=None, columns=None, scan_length=None,
raise ValueError(msg)

run = dataset.get_or_create_node('run', entities, image_file=img_f,
duration=duration, repetition_time=tr)
duration=duration, repetition_time=tr,
acquisition_time=ta)
run_info = run.get_info()

# Process event files
Expand Down
13 changes: 8 additions & 5 deletions bids/variables/kollekshuns.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ def _all_dense(self):
for v in self.variables.values()])

def resample(self, sampling_rate=None, variables=None, force_dense=False,
in_place=False, kind='linear'):
in_place=False, kind='linear', **kwargs):
''' Resample all dense variables (and optionally, sparse ones) to the
specified sampling rate.

Expand Down Expand Up @@ -276,7 +276,8 @@ def resample(self, sampling_rate=None, variables=None, force_dense=False,
# None if in_place; no update needed
_var = var.resample(sampling_rate,
inplace=in_place,
kind=kind)
kind=kind,
**kwargs)
if not in_place:
_variables[name] = _var

Expand All @@ -289,6 +290,7 @@ def resample(self, sampling_rate=None, variables=None, force_dense=False,

def to_df(self, variables=None, format='wide', sparse=True,
sampling_rate=None, include_sparse=True, include_dense=True,
integration_window=None,
**kwargs):
''' Merge columns into a single pandas DataFrame.

Expand Down Expand Up @@ -343,9 +345,10 @@ def to_df(self, variables=None, format='wide', sparse=True,
sampling_rate = sampling_rate or self.sampling_rate

# Make sure all variables have the same sampling rate
variables = list(self.resample(sampling_rate, variables,
force_dense=True,
in_place=False).values())
variables = list(
self.resample(sampling_rate, variables,
force_dense=True, in_place=False,
integration_window=integration_window).values())

return super(BIDSRunVariableCollection, self).to_df(variables, format,
**kwargs)
Expand Down
2 changes: 1 addition & 1 deletion bids/variables/tests/test_variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def generate_DEV(name='test', sr=20, duration=480):
ent_names = ['task', 'run', 'session', 'subject']
entities = {e: uuid.uuid4().hex for e in ent_names}
image = uuid.uuid4().hex + '.nii.gz'
run_info = RunInfo(entities, duration, 2, image)
run_info = RunInfo(entities, duration, 2, 2, image)
return DenseRunVariable('test', values, run_info, 'dummy', sr)


Expand Down
41 changes: 30 additions & 11 deletions bids/variables/variables.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,8 +253,8 @@ def split(self, grouper):
subsets = []
for i, (name, g) in enumerate(data.groupby(grouper)):
name = '%s.%s' % (self.name, name)
args = [name, g, getattr(self, 'run_info', None), self.source]
col = self.__class__(*args)
col = self.__class__(name=name, data=g, source=self.source,
run_info=getattr(self, 'run_info', None))
subsets.append(col)
return subsets

Expand Down Expand Up @@ -419,7 +419,7 @@ def _build_entity_index(self, run_info, sampling_rate):
self.timestamps = pd.concat(_timestamps, axis=0, sort=True)
return pd.concat(index, axis=0, sort=True).reset_index(drop=True)

def resample(self, sampling_rate, inplace=False, kind='linear'):
def resample(self, sampling_rate, integration_window=None, inplace=False, kind='linear'):
'''Resample the Variable to the specified sampling rate.

Parameters
Expand All @@ -436,7 +436,7 @@ def resample(self, sampling_rate, inplace=False, kind='linear'):
'''
if not inplace:
var = self.clone()
var.resample(sampling_rate, True, kind)
var.resample(sampling_rate, integration_window, True, kind)
return var

if sampling_rate == self.sampling_rate:
Expand All @@ -447,18 +447,36 @@ def resample(self, sampling_rate, inplace=False, kind='linear'):

self.index = self._build_entity_index(self.run_info, sampling_rate)

x = np.arange(n)
num = len(self.index)

from scipy.interpolate import interp1d
f = interp1d(x, self.values.values.ravel(), kind=kind)
x_new = np.linspace(0, n - 1, num=num)
self.values = pd.DataFrame(f(x_new))
if integration_window is not None:
from scipy.sparse import lil_matrix
old_times = np.arange(n) / old_sr
new_times = np.arange(num) / sampling_rate
integrator = lil_matrix((num, n), dtype=np.uint8)
count = None
for i, new_time in enumerate(new_times):
cols = (old_times >= new_time) & (old_times < new_time + integration_window)
# This could be determined analytically, but dodging off-by-one errors
if count is None:
count = np.sum(cols)
integrator[i, cols] = 1

old_vals = self.values.values
self.values = pd.DataFrame(integrator.tocsr().dot(old_vals) / count)
else:
from scipy.interpolate import interp1d
x = np.arange(n)
f = interp1d(x, self.values.values.ravel(), kind=kind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok for upsampling not fine for downsampling. depending on frequency content this will introduce aliasing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure avoiding aliasing is in scope... or at least, I'm not sure we want to implicitly start low-pass filtering the input signal without the user's explicit consent. I'd be okay adding an explicit TemporalFilter transformation (actually I think this is already in the spec and just not yet implemented), and the user can then call that themselves before the resampling step if they like. But doing some magic internally to pick a suitable filter or otherwise avoid aliasing means (a) reproducibility across packages is limited, and (b) we're taking on the responsibility for producing sensible results, and to my mind this really falls on the user.

x_new = np.linspace(0, n - 1, num=num)
self.values = pd.DataFrame(f(x_new))

assert len(self.values) == len(self.index)

self.sampling_rate = sampling_rate

def to_df(self, condition=True, entities=True, timing=True, sampling_rate=None):
def to_df(self, condition=True, entities=True, timing=True, sampling_rate=None,
integration_window=None):
'''Convert to a DataFrame, with columns for name and entities.

Parameters
Expand All @@ -474,7 +492,8 @@ def to_df(self, condition=True, entities=True, timing=True, sampling_rate=None):
sampled uniformly). If False, omits them.
'''
if sampling_rate not in (None, self.sampling_rate):
return self.resample(sampling_rate).to_df(condition, entities)
return self.resample(sampling_rate,
integration_window=integration_window).to_df(condition, entities)

df = super(DenseRunVariable, self).to_df(condition, entities)

Expand Down