Learning over time: Data collection

Áine
6 min readJun 21, 2022

In late 2019, my colleagues and I wrapped up lab-based data collection on a longitudinal project that involved over 900 EEG sessions with infants. The babies were each seen in the lab eight times before their first birthdays. The majority of these lab sessions also involved motion capture and eye-tracking. EEG data collection had taken somewhere between two and a half and three years. By the time I joined in summer 2018, data collection was in full swing with two to three EEG sessions per day. We are still working through data analysis on this huge sample, and many of the babies we studied are probably school-aged by now.

It was a massive undertaking and I’m sure that, as a large and frequently-changing team, many different people have different views on what worked and what didn’t. I wanted to record the things we did that I found useful and practical, written about in a personal capacity (i.e. not an official project one). I wasn’t around for the experimental design and paradigm planning stages so I’ll start with data collection.

It’s a marathon, not a sprint: Everyone has different tolerances for long work hours and juggling days in the lab with evenings at a desk. However, long working hours usually come with the promise of a change of pace after an experiment wraps up, a paper gets submitted, term ends, and so on. If you’re working with babies who you have to see in the lab multiple times within a fixed period of time, the typical ebbs and flows of uptime and downtime in a research project don’t apply. Once you put your foot to the pedal it won’t come off, because babies will not stop growing up even if you ask politely. My advice here is not to push yourself to the edge of your tolerance for working after a long testing day — you’ll probably end up on that brink anyway.

We had a policy of scheduling days off to make up for weekend work. If someone was testing on a Saturday, they turned it into a full work day and took a full day off during the week. Later, we proposed that days off were scheduled so people would typically get a two-day “weekend”. I have no issues dipping in and out of work at the weekends now (a bad habit, I know), but testing infants on multiple paradigms multiple times a day for a full week was so much more energy-intensive and fatiguing than writing papers or running analyses.

Be patient about papers: Relatedly, trying to be “productive” in terms of publishing, on top of spending most of the week in the lab, turned out to be impossible. We had so much to do, including getting tasks ready for new data collection timepoints as the infants grew into toddlers, dealing with technical and other issues arising, backing up, managing and auditing data, trying (failing) to stay on top of the literature, creating preprocessing pipelines, visualising data, running analyses…

The volume of work was immense, and pile on the fact that huge amounts of testing means huge amounts of data to deal with. Then, on top, add the noisy, messy nature of infant data which means files not moving easily through pipelines. Not just becauseof EEG artefacts, but because of odd things happening during sessions. Videos being paused without pausing EEG recording; EEG recording paused without pausing eye-tracking recording; triggers disappearing and reappearing in the wrong places, or with the wrong labels; all sorts of things that will throw up an error in your finely tuned code. Detective work takes time, as does adding in the extra if-statements needed to deal with unusual issues. I’m amazed and proud that we ever managed to get enough processing and analysis done to present at conferences before 2020.

I don’t believe we submitted any papers until 2020. That’s a multi year lag from the project start date, with lab testing having totally finished by the time we did submit the first manuscript. This probably feels like anathema to other postdocs but without the manpower to hand testing over completely, juggling testing and publishing was impossible. We rely on a trickle of papers from past jobs to appear “employable” to people who care about this sort of thing. Do I think this lag to publishing will have some sort of negative impact on our careers? No, because it allowed us to be thorough, cautious, and informed, and write and publish things we stand over.

Sharing the load: This one is probably a bit polarising, as I know a lot of people feel that postdocs should focus on analysis and writing, and RAs should do the testing. In our project, RAs did the behavioural data collection, but postdocs and RAs shared the load of lab testing. I think this was fruitful, because it allowed postdocs with more experience of baby testing and of baby EEG, eye-tracking and mocap to keep an eye on data quality (though I was useless with mocap personally), and deal with issues arising quickly. I also like doing (some) testing because I want a feel for the paradigm and the data quality. For example when I showed up on the project I noticed immediately that some caps were getting increasingly damaged from grabby little baby hands and heavy use over time. Some day I’ll be that annoying PI who comes into the lab, pretends she knows about the new cutting-edge techniques (or dismisses them offhand), breaks stuff unknowingly, and doesn’t take the hint to leave, but for this project shared testing was a good thing.

The RAs became very experienced and capable with these techniques, and we maybe could have made them bear the burden of more lab testing as time went on, but I think that the result would have been massive RA burn-out and a lot more staff turnover. Testing is hard, and I find it much more draining than sitting at a desk and writing code. On a long-term project that requires input from multiple people, processes that foster cohesion, mutual respect, and trust are really valuable.

Families and participants are a precious resource: Part of the load in any project like this is also shared with the families. When I started on the project I was totally amazed by the number of participating families, and the retention. These parents came back month after month to have their babies listen to weird repetitive sounds after we had wrestled a damp cap full of sensors and wires on the little ones’ heads! I had assumed retention in a longitudinal project would be difficult, but I think we showed the families how much we appreciated them — through patience, flexibility, and friendliness — and they did so much for us in return.

Keep on top of data management: One of the things that we did well, but would have been so much easier if we’d had time to start it earlier, was data management. We had a protocol for backing up data after testing sessions, and carved out some time for an EEG data audit in 2018–19, but even now we’re discovering missing data without explanation, corrupted files, misnamed files, and so on. An enterprising RA suggested we use Redcap to keep track of collected and coded data, which was a gamechanger. I suggested we make standardised lab log-sheets, instead of the more open-entry type we were using, which could go into Redcap and be searchable for artefacts, infant behaviour, “red flags”, and anything else informative during preprocessing and analysis. This sort of forward thinking was helpful, especially when there were datasets we didn’t analyse until years after it was collected. I think we did a good job with the resources we had, but if I were to run a project like this myself I would definitely devise an RA role with a substantial proportion of their time dedicated specifically to data management.

Those are the things that occur to me when I think about the data collection lessons I might take from this very special and ambitious project, and bring with me into my future career. It has been such a privilege to work on this project and to learn so much from my PI and especially my admin, postdoc and RA colleagues. Perhaps I’ll write another piece about data processing and analysis, especially facing into my own new project in the next few months.

--

--