PivotTable Based Impurity Analysis
Using PivotTable in Excel and Python to perform impurity analysis of chromatographic data.
Motivation
As a scientist specializing in process chromatography, I've analyzed numerous chromatograms containing proteins of interest, whether in intermediate stages of purification or in final stages like cation-exchnge and reverse-phase chromatography. In these later stages, it's crucial to identify impurities present in the solution, in addition to the protein or biosimilar of interest. By following a predefined method protocol and understanding the biosimilar's retention time, we can usually identify impurities effectively.
However, when dealing with a large number of samples, determining the percentage of impurities in a particular sample at a specific stage can be a time-consuming task. Additionally, if a scientist wants to compare one stage's sample to another, they often have to manually analyze the chromatographic peak data.
To simplify and streamline this process, I have developed a solution using a combination of Python programming and Pivot Table and Pivot Chart functionality available in Excel. This solution makes it easier to identify impurities and compare samples across different stages, and it also presents the data in a clear and professional manner.
The impurity analysis of biosimilar downstream purification involves several fundamental steps, which are as follows:
Collecting Relevant Data: The first step is to obtain the relevant data from the chromatography system and transfer it to an Excel file.
Cleaning the Data: Next, the data must be processed and cleaned to remove any anomalies, errors, or inconsistencies that may affect the results.
Determining Relative Retention Times: Once the data has been cleaned, a Python program can be used to calculate the relative retention times (RRT) for each of the impurities present.
Performing Impurity Analysis: Finally, a Python library or Excel Pivot Table can be employed to perform impurity analysis on the data. This analysis will provide information about the types and quantities of impurities present in the samples, enabling researchers to make informed decisions about downstream purification processes.
By following these steps, researchers can obtain accurate and reliable data about impurities present in biosimilar samples, which is essential for ensuring the safety and efficacy of the final product.
Collecting Relevant Data
The chromatography peak data can typically be obtained as an Excel or CSV file from the chromatography system, and it contains essential parameters such as the sample name, identifiable biosimilar or intermediate product at a specified retention time (referred to as the Biosimilar Content column), retention times for each component, and the area under the curve (AUC) or, preferably, the percentage area.
A sample chromatography impurity file has been provided to illustrate these parameters and their significance in impurity analysis. In the table below, we see analytical chromatography data of three samples whose biosimilar of interest is Liraglutide at a retention time of 25.4 min.
Cleaning the Data:
Cleaning the chromatography data is an essential step in the analysis process.
If there are washable components present in the sample, their retention times should be excluded from the analysis to prevent unwanted high peaks at the start and end of the chromatogram from dominating the analysis.
It is important to ensure that each sample has a unique name and is not duplicated elsewhere in the data. If identical sample names are necessary, it's recommended to include another column that differentiates them by batch name, stage, or some other parameter.
Once the data has been cleaned in Excel, it's necessary to convert the range to a table with the "table has headers" box checked by selecting the table and pressing Ctrl-t. This step ensures that the data is organized and easy to work with.
It is important to name the peak of interest for every sample.
With a cleaned Excel table, dataframes can be created in Python, making it easier to analyze and visualize the data.
Determining Relative Retention Times using Python
To determine the relative retention times (RRT) of impurities, a Python program can be used to divide each impurity's retention time by the retention time of the component of interest. This calculation provides a reliable measure of the impurity's RRT values relative to the protein or biosimilar of interest, ensuring accurate identification and analysis.
Import relevant Python libraries:
import pandas as pd
import math
import numpy as np
Read excel file and get relevant columns in a pandas dataframe.
df = pd.read_excel('Analytical Chromatogram Data.xlsx', sheet_name = 'NoRRT'
usecols=['SampleName', 'Biosimilar Content', 'Retention Time', '% Area'])
To calculate the RRT values for each sample, two lists are needed: the Biosimilar_Peak list and the Sample_List. The Biosimilar_Peak list contains the retention time data for each peak of interest in the list of samples, while the Sample_List lists all the samples based on their names in the SampleName column, ensuring that each name is unique.
To create the Biosimilar_Peak list, we filter out empty cells from the 'Biosimilar Content' column and retain the retention times only for the rows that have a name attached to them. For the Sample_List, we can use the built-in unique() function in the pandas Series and convert the resulting collection to a list.
Biosimilar_Peak = []
for index, row in df.iterrows():
if not row['Biosimilar Content'] is np.NaN:
Biosimilar_Peak.append(row['Retention Time'])
Sample_List = list(df['SampleName'].unique())
After creating the Biosimilar_Peak list and the Sample_List, we map each sample in the Sample_List to its corresponding retention time values in the Biosimilar_Peak list, resulting in a dictionary where the sample names are the keys and the retention times are the values. This mapping enables us to easily access and analyze the data for each sample in subsequent steps.
my_dict = dict(zip(Sample_List, Biosimilar_Peak))
To complete the process of calculating RRT values, we add a new column to the cleaned Excel table titled 'RRT'. We then use a lambda function to extract the corresponding retention time value from the dictionary we created earlier, based on the SampleName key, and apply it to the cells in the 'RRT' column. Once this is done, we save the dataframe to a new Excel file for further analysis.
# Create a new column by dividing column 'Retention Time' with the corresponding denominator value from the dictionary
df['RRT'] = df.apply(lambda x: round(x['Retention Time'] / my_dict[x['SampleName']],2), axis=1)
df.to_excel('Analytical Chromatogram Data.xlsx', sheet_name='RRT')
The updated table with RRT values is shown below:
Performing Impurity Analysis
After obtaining the RRT values, there are two ways to proceed. The first option is to use the cleaned Excel table to create a pivot table and a pivot chart, which can be customized according to the user's preferences. The second option is to use the pivottablejs library in Python to create a similar impurity profile. While Excel may be more user-friendly, using the pivot_ui function in pivottablejs can provide a hands-on experience for the user to interact with the data before submitting it. It is important to understand both options to determine which method best suits the user's needs.
Using Pivot Table and Pivot Chart in Excel
After obtaining the RRT values, we can create a Pivot Chart in Excel to visualize the impurity profile. To do this, select the table, go to Insert and choose Pivot Chart.
This will display a PivotTableFieldList and a placeholder for PivotChart.
Next, drag and drop the SampleName from the 'Choose fields to add to report:' section into Axis Fields (Categories) and RRT into Legend Fields (Series). Then, add % Area to the Values section and make sure that the average values are displayed by clicking the % Area dropdown, selecting Value Field Settings and choosing Average under the 'Summarize value fields by' section. It's also possible to customize the name here.
After setting up the chart, a column chart will be displayed that shows the percentage of impurities (RRTs other than 1.0) present among the protein of interest (RRT equal to 1.0). If the names of impurities are known, their RRT values can be replaced with the names in the Pivot Table. For instance, if we know that RRT 0.3 is Desamido, we can replace 0.3 in the column header with 'Desamido', and this will be reflected in the plot.
For a more detailed view of the relative composition in a 100% stack, a stacked column chart can be created instead. To do this, select the chart, go to the Design tab, choose Change Chart Type, and select 100% stacked column. This will display the samples along with the percentage of impurities in a stacked column.
Final output showing % impurity content in a 100% stacked column chart is shown below.
Using pivottablejs in Python
To generate the HTML and JavaScript code for the pivot table, we can make use of the pivot_ui function from the pivottablejs library and the HTML function from the IPython library.
First, we need to import the necessary libraries:
import pivottablejs
from IPython.display import HTML
Next, we can call the pivot_ui function on our cleaned and transformed DataFrame:
pivot_ui(df, outfile_path='pivot_table.html')
HTML('pivot_table.html')
To make changes inline, do not call HTML function. But it is preferable, as the HTML file can be forwarded as an attachment to be used collaboratively. pivot_ui creates a pivot table like interface as shown below.
Drag and drop SampleName button from the top stack to the left area. Drag and drop the RRT button from the top stack to the stack just beneath it. In the dropdown that displays 'Count', choose 'Average'. As soon as it is chosen, another dropdown appears that provides a choice among the table columns. Choose % Area from the dropdown. A similar pivot table as in Excel is created. The same procedure can be followed while accessing pivot_table.html file.
It can also create a 100% stacked chart, but the UI is not user-friendly, and I would suggest to use Excel file instead. The final stacked column chart is shown below.
Using Power BI
A similar impurity profiling can be done using Power BI. Once the excel file with RRT values is updated, load the data into Power BI.
If we have an Excel file without RRT values, we can modify the table inline to create RRT values using Python scripting in Power BI. Go to Get Data -> Python script and add the following python code:
import pandas as pd
import math
import numpy as np
df = pd.read_excel("< absolue path to the file'Analytical Chromatogram Data.xlsx' on your PC >", sheet_name = 'NoRRT',
usecols=['SampleName', 'Biosimilar Content', 'Retention Time', '% Area'])
Biosimilar_Peak = []
for index, row in df.iterrows():
if not row['Biosimilar Content'] is np.NaN:
Biosimilar_Peak.append(row['Retention Time'])
Sample_List = list(df['SampleName'].unique())
my_dict = dict(zip(Sample_List, Biosimilar_Peak))
# Create a new column by dividing column 'Retention Time' with the corresponding denominator value from the dictionary
df['RRT'] = df.apply(lambda x: round(x['Retention Time'] / my_dict[x['SampleName']],2), axis=1)
Here, < absolue path to the file'Analytical Chromatogram Data.xlsx' on your PC > needs to be replaced by the absolute path of the Excel file with impurity data on the user's PC. Updated pandas DataFrame is recognized by Power BI and can be loaded into the interface as a new table.
In case, the user has NoRRT file preloaded into Power BI and wants to create RRT values inline, they can use 'dataset' in Python script.
Load NoRRT Excel table as explained earlier.
After initial modifications, the NoRRT file looks something like this :
2. Go to Tranform > Run Python Script. Add the following Python code :
# 'dataset' holds the input data for this script
import pandas as pd
import math
import numpy as np
Biosimilar_Peak = []
for index, row in dataset.iterrows():
if not row['Biosimilar Content'] is np.NaN:
Biosimilar_Peak.append(row['Retention Time'])
Sample_List = list(dataset['SampleName'].unique())
my_dict = dict(zip(Sample_List, Biosimilar_Peak))
# Create a new column by dividing column 'Retention Time' with the corresponding denominator value from the dictionary
dataset['RRT'] = dataset.apply(lambda x: round(x['Retention Time'] / my_dict[x['SampleName']],2), axis=1)
dataset
Here 'df' is replaced by 'dataset' which provides definition to the current NoRRT table. Once the script is run, we get the following record table.
Click on the table link next to dataset to get the RRT modified table as shown below.
Rename 'NoRRT' to 'Impurity Table'. Thereafter, a 100% stacked bar plot can be created using 'SampleName' as y-axis, '% Area' as x-axis, 'RRT' as legend.
Note that based on the date, we may have to cluster RRT values to reduce the data. The method for that will be explained in another post. The stacked bar plot is demonstrated below: