Best Practices for Collecting and Handling Data

Collecting and handling data are segments of research which can define the quality of the research but also its validity and publication potential. If performed well but not described in detail, data collection procedures might cause the rejection in some journals.

On the other hand, handling data, both technically and scientifically, is another important area of dealing with data in research. Data collection best practices will show us this process should be planned beforehand and its characteristics predefined. Data should not be collected without rules (chaotically).

Primary data collection

Data collection can involve primary data or data that has not been collected before. This is called primary data collection.

Primary data collection is highly dependent on the tools and methods we use to collect data. In all research, data accuracy is essential. You must know the level of accuracy and validity for data to be included. The tools to collect data can be instruments like:

Thermometers
Microscopes
Biochemical instruments
Stadiometers
Calipers
Basic rulers

Standardization

One of the most important principles in research is called standardization. A significant part of standardization is related to the instruments and their relation to data collection procedures. Standardization is a process that makes sure that there are certain standards implied when collecting data. These standards are most often based on two principles : A - making sure the instrument or tool is accurate, B- making sure the data is valid and C- making sure the data collection process is reproducible and comparable to other data sources. In every data collection procedure, one of the first things to think about is whether the data collection tool or instrument is standardized and its measurements accepted within the academic and research communities.

In addition to instruments being standardized, the processes enabling the data collection can be standardized too.

Examples:

Standardized biochemical reagents preparations for biochemical experiments
Standardized staining principles in microscopy
Standardized ways of collecting fossils in paleontology

Standardizing these enables reproducibility and credibility of research data collection.

Further data collocation processes can be enabled by observations in some research areas, especially life sciences, such as making a taxon identification by a biologist. Observing taxons in nature and making data tables from identified taxons is a frequent procedure in zoology and ecology. To make sure data collection is done properly, a researcher should have a taxon identification document known as taxon identification keys.

Many research areas can include observation data. In such cases, it's important to have documented the principles on which these observations are made. The key is reproducibility.

Primary data sources

Some primary data sources may include data which comes from simple answers. This type of data collection is based on surveys. Survey documents should be clear, related to the goal of the study or research, and have a good separating power between possible answers.

The availability of the participants might be a concern. Sometimes participants are not willing to provide information. Lack of certain information may lead to data biased towards those willing to participate in the survey. Sampling bias must be avoided too. Surveys are the most used tool in social sciences, psychology, economy, and many other areas.

Secondary data

Secondary data collection is the gathering of the data which was collected and stored before and is reflected by future researchers. Most often this data is in digital form and includes digital documents and use of computational tools and platforms.

What is the rule of the thumb for assessing secondary data? Simple: find all the aspects of data which would also be needed if the data is collected from the secondary source. Gather all data about the instruments, methods of collection, data quality, survey questions, principles of data collected, and standardization status.

Secondary data is generally stored in a specific file called meta-data. The focus of secondary data is to retrieve the metadata to ensure the information is complete and has all the relevant information about the primary data collection.

It is generally a good idea to see if the research questions are complemented well with the meta-data. One of the most frequent mistakes when working with previously collected primary data is focusing on handling the dataset relevant to the analysis instead of focusing on the meta-data first.

Here is a gene expression exemplar dataset retrieved from NCBI or National Center for Biotechnology Information:

Here is how meta-data adds up information for the terms in the dataset. This is the dataset for mice obesity research experiments from the NCBI.

Metadata contains all the relevant information about the experiment type, experimental design, methods applied, protocols, type of data collected, and terms in the dataset. In this case, animals used in the experiment and the interventions applied. Making conclusions in research is highly dependent on these types of data.

We can now make another conclusion. Going back to the primary data, you should document all the data collection procedures, instruments, standards, experimental designs and other details to create a weta-data document. This enables future researchers to observe the data with transparency and reuse the data from the primary authors.

One of the most important aspects in collecting the data is collecting the complete and valid meta-data – part of the data which is the context around the observations, measurements or answers in surveys. Meta-datacan influence the data dramatically and it's essential to capture it.

Another reason why meta-data is important is the inclusion criteria in the research. Every good research project has a very defined inclusion criteria on which data points can be included. Most often the meta-data can help answer the questions about the inclusion criteria and facilitate the data collection process.

Missing data

Often part of the data tables or spreadsheets might contain missing observations. These might be blank spaces or marked as NA(not applicable). In both cases this means that data could not be collected for a certain reason. These reasons might include a patient not being available, a survey question not being answered or any other reason with data points that could not be stored.

The question is, should these be missing data points? The answers might vary. Sometimes the Data professionals might impute the missing data (create artificial data) based on certain algorithms, but it should be kept in mind that this should be done only in certain areas. For example in Computer simulations of technical nature data imputation is sometimes used. But, in Biomedical studies, it is important to use only real world data and have maximal data integrity. So for biomedical studies the missing data is often not imputed and is treated just as missing data (data points left blank). Depending on the size of the data, data rows in the table containing missing values can be excluded which would leave only the rows with no missing data. If the sample size is smaller, this is not a good idea, as it might decrease the sample size to the extent where the sample will not contain enough data to answer the research questions.

Cleaning data

Another important process in handling the data is making sure that any invalid parts of the data are identified and removed. This process is closely related to another set of principles called Data Validation. Based on predefined quality standards, inclusion criteria and reference values for valid data in the research, any invalid data should be removed. But the researchers should be cautious with this process. Any invalid data points should be rechecked multiple times for their validity. One of the biggest mistakes in data handling would be removing valid data points. Sometimes authors tend to remove outliers in the data just because they are far away from the rest of the data. This can lead to removing valid data points and should be avoided. Contrary to that data should be removed only if its caused by measurement errors, inaccuracies, not fitting the inclusion criteria or other details which would make data points invalid for a certain research project.

Analyzing the data and presenting it to the research community

Analyzing data to derive insights and answer research questions is a very important phase in handling data. In this segment, one of the most important aspects is choosing the right analytic methods to apply to the data. Each analytic method has a clear set of data types corresponding to the method and vice versa. Another important aspect in data analysis is to be objective to all parts of the data and not favor any result. Negative results in research are also results. Having no result is also a result. Author should never favor analyzing only the data which produces positive results, and focus equally on all the data included in the study.

Main Data outputs are numbers or characters, so they are in numeric or textual form. To make presenting the results and insights derived from the data analysis more intuitive and easier to understand, one of the best ways to present these is through Data visualization. Data visualization should be related to the questions asked and methods applied in the research. This means that the graphics in the research are answering the main research questions and are compatible with the methods applied to answer them. One example is if the goal of the research project was to analyze frequencies, then the frequencies should be analyzed and presented in Data visualizations.

Data visualization is one of the best ways to make the mathematical part of the research more intuitive. This means, the data presentation should be:

Easy to understand and interpret
Consistent - the styles used for visualization should not vary too much
Adapted for the audience - Authors should understand which audience will read be engaged with the research and which data visualizations will be understandable for them
Simple and clear - Highly complex visualizations can confuse readers, so data visualizations should be simple yet telling the whole story about the data
Aligned for comparisons in comparisons are present
Following standards - There are certain standards for Academic research or different industries research. For example some academic journals require the graphs to have certain styles adapted to the journal audience.
Having the right kind of visualization - Chasing the right type of visualization is very important. For example, reporting medians should be accompanied via boxplots, but frequencies can be reported using bar charts. Data type often defines the data visualization to be used

Lastly, even though data visualization is a great way to intuitively present the data to the audience or other stakeholders in research, you should accompany the visualizations with the numbers. Avoid leaving out numbers for different analyses. Numbers are the best way to have accurate metrics. Numbers make the metrics easy to compare and use as a reference for future research. Having all the relevant numerical data in the research makes the data highly comparable which is essential in research.

Contributors

AJE Team

The AJE Team

Tag

Table of contents

Join the newsletter