The challenge of categorizing research

The challenge of categorizing research

Assigning publications to research fields can be a challenge. While the demarcation of fields can be supported by algorithms, labeling fields properly requires to know what holds them together. I investigated this problem and discovered interesting reasons for publications to form a research field.

I recently visited CWTS to discuss and present my PhD project, which is about the algorithmic classification of research publications. My previous experiences when working with this topic and the discussions during my week at CWTS led me to further consider the question about what holds a research field together.

In my work as a bibliometric analyst researchers frequently ask our bibliometric group for analyses within some particular field of research. The complexity of identifying the publications that belong to a research field is a problem rarely considered by the user. Sometimes the publications are relatively easy to identify, for example when corresponding well to a Medical subject heading. But even in those cases, the user may perceive the field differently than what is expressed by the retrieved set of publications.

If we look into the fantastic tree of terms provided by the Medical subject headings (MeSH), we find terms expressing different properties of a publication. Some branches contain terms that correspond to what we would perceive as research disciplines, e.g. Philosophy or Psycholinguistics while other branches include physical objects, such as chemicals or medication. There is also a branch for geographic locations, one for Diseases and another for Phenomena and processes. MeSH terms are manually assigned to publications and are mainly created to improve search systems, rather than categorizing research publications. Publications can be categorized by combining MeSH terms, however, this requires preconception of the research category since not all combinations are meaningful.

Map of science peter Map of science peter
The image shows a map of science based on 28 million publications and their citation relations. 234 disciplines are shown. The image is for illustrative purposes only. Certain data included herein are derived from the Web of Science ® prepared by Clarivate Analytics ®, Inc. (Clarivate®), Philadelphia, Pennsylvania, USA: © Copyright Clarivate Analytics Group ® 2019. All rights reserved.

When creating classifications based on citation networks, classes are obtained based on the formal communication taking place in the form of citations. Thereby, such classifications reflect the formal communication practices of the research society, rather than the organizational division of research into disciplines. In this case, no preconceptions of research categories are needed. It is not an easy task to understand what holds together the different classes created by such classification, not the least since the publication set can be very large (currently we cluster about 28 million publications at Karolinska Institutet where I work).

To this point I can at least conclude two things from my experiences: First, the kind of properties that hold together a research field differ from field to field, and second, research fields are formed by a combination of properties. I will give three examples:

  1. Some years ago we studied the research field of nanocellulose materials at my former workplace at KTH Royal Institute of Technology. Nanocellulose can be used to create strong, thin materials from natural fibers or bacteria. We could identify three sub-fields within the field, all of which focused on different methods to obtain nanocellulose. In this case, the research field is defined by different methodologies.
  2. In a study of mine and minerals research, also at KTH Royal Institute of Technology, we noticed that fields were centered around the combination of geographic locations and topical properties. This is not very surprising since the geography is of importance for mineral extraction. Also within the medical field, I have found research fields that focus on particular geographic areas, for example, the primary care in Mexico.
  3. Third, a research field may also reflect the combination of diseases and treatments. A single treatment may be applied to several diseases. For example, hydroxyurea is a medication that is used to treat several different conditions, among others, sickle cell disease and cervical cancer. Both cases of application can be identified as distinct research fields. Interestingly, even a different kind of combination is possible: another field identified focuses on the causal relation between hydroxyurea and leg ulcers.

In my current work, I improve methods for labelling algorithmically obtained publication classes. This work actualizes what defines a research field, and how a field can be described. The examples above show that a combination of several terms is sometimes necessary to describe a research field accurately. Further, different kinds of properties may hold a field together and this must also be expressed by class labels.

Of course, a perfect classification will never be obtained. On the contrary, I believe that the availability of different classifications and different methods to delineate research is useful when answering different questions about research activities. Algorithmic classification gives some information about what kind of properties form a research field, at least if we have proper labels that make it possible to interpret the contents of classes. I hope to further contribute to this problem of labelling classes with my ongoing PhD research.


Add a comment