Reading lines of data
The file object has certain functionalities associated with it, which allow the underlying
data in the file to be read. The most commonly used functions are: read(), readline() and
readlines(). The read() function is used if you want to read an entire file in one go, into
one long string.
data = fileObj.read()
The read() function has an optional argument that specifies the required number of
bytes (addressable units of information) to load, but reading the entire file in one go is
more common for this function. Of course, if the file is huge this might not be a good idea,
because of memory limitations. Accordingly, the readline() function reads only one line
from the file, and is often placed in a loop to process multiple lines, without having to load
everything at once.
line = fileObj.readline()
As far as readline() and readlines() are concerned, each line of the file is defined as a
text string ending in the newline character, or a string that stops at the end of the file
without a newline. In other words the newline character separates the lines. Note that the
newline character at the end of the line is not removed when using this function, i.e. it is
included as the last character of the returned string.
For Unix-derived computer systems (e.g. Linux, OS X) the ‘\n’ newline character is the
normal convention. However, for Windows computers the normal convention is that a line
ends with the two characters ‘\r\n’, but that also works for the above example because the
last character is still ‘\n’. In some situations a file might have lines that only end with ‘\r’
and, given the way we have opened the file here, this would not automatically be
recognised by ‘readline’ as the end of the lines, even though it is intended to be.
Python has provided a convenient way to deal with the ‘\r’ versus ‘\n’ end-of-line issue.
(The newline ‘\n’ and carriage return ‘\r’ concepts are originally from the humble
mechanical typewriter.) The mode argument to the open() function can include the
character ‘U’, to specify universal line interpretation, so, for example:
fileObj = open(path, "rU")
This means that when the file is read every occurrence of ‘\r\n’ is replaced with ‘\n’ and
every occurrence of just ‘\r’ is replaced with ‘\n’. This is the recommended method of
opening a text file when you are not sure of its line endings, unless of course the ‘\r’
characters (singly or in combination with ‘\n’) are required and mean something specific
for the file being considered.
Every time you read part of a file, for example, using readline(), a register of which line
is next to be read, the file pointer, advances in the file. Hence, the next time you read some
more of the file, you read from where the previous read ended. When the file pointer
reaches the end of the file then the next readline() gives back an empty string; a
conveniently False value. Thus if you want to process one line at a time in a file you could
do the following where the loop continues as long as the line is True:
fileObj = open(path, "rU")
line = fileObj.readline()
while line:
# process line
line = fileObj.readline()
fileObj.close()
However, there is a more elegant alternative to using readline() repeatedly: from Python
2.2 onwards you don’t have to manage the lines yourself. Rather, the open file acts as an
iterable object which leads to much simpler code, i.e. so you can loop through the file as if
it was a list, yielding the lines inside the loop:
fileObj = open(path, "rU")
for line in fileObj:
pass # process line
fileObj.close()
The function readlines() reads all the lines in the file in one go, and returns a list of the
lines; a list of strings. Accordingly, an alternative way to process an entire file would be to
do:
fileObj = open(path, "rU")
lines = fileObj.readlines()
fileObj.close()
for line in lines:
pass # process line
Again, as with the read() function, this is a reasonable approach if the file is not too
large. There is also an optional argument for readlines() giving a number of bytes to read,
whereupon that amount of data will be read, including any extra bit required to complete a
final, otherwise partial line. Another option, which is slicker, but arguably less clear, is to
open and read the file in a single statement:
for line in open(path, "rU"):
pass # process line
Here the file is closed implicitly, because it was not assigned to a variable, and this is a
case where that is acceptable coding style. It is obvious, given that no explicit variable
name is stated, that the file is no longer used once the loop has finished.
Another alternative to manually closing a file object is to use the with … as statement,
which was introduced on Python 2.5. For example, we could write:
with open(path, "rU") as fileObj:
for line in fileObj:
pass # process line
Here the with statement assigns the opened file object to the fileObj variable in a
special way. We won’t go into the precise details of what is happening, but the basic
principle is that a file class of object has inbuilt methods (__enter__ and __exit__) to deal
with its setup and release. In this case the result is that the file is closed at the end of the
with code block. Note the with and as keywords are a general part of Python, and not
specifically related to files.
Do'stlaringiz bilan baham: |