Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet241/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   237   238   239   240   241   242   243   244   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Reading array data

We will define the Microarray class a bit later, but first we will look at what kind of raw,

experimental data it will contain. This will be illustrated by writing functions that extract

data  for  a  Microarray  object  from  the  contents  of  files,  specifically  text  files  and  image

files. An alternative approach would be to make a blank Microarray object and then have

that load data into itself, but here we aim to first show the kind of actual underlying data

we are dealing with.

Importing text matrices

The load function will assume a simple file format where we have one line of text for each

array element and the columns of data in the file are identifiers for the array coordinates

(these  could  be  the  row  and  column  of  the  array)  and  the  actual  data  signal  value.  For

example, this could be something like:

A B 1.230

A C 4.510



A D 0.075

B C 4.999

B D 0.258

C D 2.312

We  will  not  assume  that  all  elements  of  the  array  are  represented  and  we  will  not

assume  that  the  array  row  and  column  identifiers  (i.e.  A,  B,  C  and  D  in  the  above

example) are either continuous or in any particular order.

The  experimental  microarray  data  will  be  represented  as  two-dimensional  NumPy

arrays.  Accordingly,  we  make  various  imports  from  the  numpy  library  for  the

mathematical array operations that will need to be done:

from numpy import array, dot, log, sqrt, uint8, zeros

The function to load data from a file and create a Microarray object takes a file name to

load from as the first argument, an identifying name for the array as the second argument

and  an  optional  third  argument  to  state  what  the  default  signal  value  is  for  an  array

element that we have no data for. This last argument may or may not be used depending

on  what  data  we  read  in,  but  we  should  at  least  be  aware  of  incomplete  or  failed  data

points. The default here is zero, rather than None, given that we are dealing with NumPy

arrays that don’t mix data types.

def loadDataMatrix(fileName, sampleName, default=0.0):

fileObj = open(fileName, 'r')

Empty  sets  are  created  which  will  contain  the  identifiers  for  the  rows  and  columns  in

the  microarray  data.  In  the  end  these  could  be  filled  with  just  numbers  representing  the

array coordinates, or they could be text labels. The only rule about these identifiers is that

because they are in sets they cannot be internally modifiable items like lists (they must be



hashable

1

). The values in these sets will be keys to access the numeric data stored in the



dataDict dictionary.

rows = set()

cols = set()

dataDict = {}

Next we loop though all the lines in the open file object and split each line according to

internal  whitespace  to  give  values  for  the  row, column  and  the  numeric  value.  Naturally

this operation would be different if the format of the file was different.

for line in fileObj:

row, col, value = line.split()

A check is made to ensure that each row identifier that comes from the file has an entry

in dataDict. If it does not then we make a new inner dictionary within the main one using

the  row  identifier  as  the  key.  Note  that  we  cannot  fill  this  dictionary,  or  any  other  data

collection, in advance because we do not know what rows or columns we have until the

file has been read.

if row not in dataDict:



dataDict[row] = {}

The actual signal value, which is a floating point number, is then added to the data by

using the row and column identifiers as keys for the main and inner dictionaries, although

we first check the dictionary to guard against using repeats for the same element. The row

and col that were just used are then added to the set of rows and cols. Because they are set

data types it does not matter if we have seen them before, given that a set automatically

ignores repeated items. Note that the value is converted using float() because initially it is

just a text string loaded from a file, and not a Python number object.

if col in dataDict[row]:

print('Repeat entry found for element %d, %d' % (row, col))

continue

dataDict[row][col] = float(value)

rows.add(row)

cols.add(col)

After  all  the  lines  have  been  processed  we  then  convert  the  set  of  row  and  column

identifiers  to  sorted,  ordered  lists.  Only  now  that  the  total  range  of  these  identifiers  has

been collected can they be used to create the axes of the NumPy 2D value array. The sizes

of this array will naturally be based on the number of row and column identifiers, which

are recorded as nRows and nCols.

rows = sorted(rows)

cols = sorted(cols)

nRows = len(rows)

nCols = len(cols)

The NumPy array to store the values is initialised as an array of zeros of the required

size, with the data type (the last argument) set to be floating point numbers:

dataMatrix = zeros((nRows, nCols), float)

The rectangular dataMatrix array is then filled by extracting values from the dataDict,

although there is provision to replace missing values with the default value (hence we use

the .get() dictionary call). Note how we use enumerate() to extract index numbers (i and j)

as  we  loop  through  row  and  column  identifiers.  These  indices  are  then  used  to  fill  the

correct position in the array. At the end the ordered list of rows and cols will be used so

that  the  indices  in  the  NumPy  array  can  be  used  to  look  up  the  original  data  labels  that

they refer to.

for i, row in enumerate(rows):

for j, col in enumerate(cols):

value = dataDict[row].get(col, default)

dataMatrix[i,j] = value

fileObj.close()

With the data collected we use it to construct a Microarray class of object, made with



the definition described below. This is then passed back from the function.

return Microarray(sampleName, dataMatrix, rows, cols)




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   237   238   239   240   241   242   243   244   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish