concept `delimiter` in category `dask`

appears as: delimiters, delimiter, delimiters

Data Science with Python and Dask

This is an excerpt from Manning's book Data Science with Python and Dask. Login to get full access to this book.

We’ll start with the simplest and most common format you’re likely to come across: delimited text files. Delimited text files come in many flavors, but all share the common concept of using special characters called delimiters that are used to divide data up into logical rows and columns.

Every delimited text file format has two types of delimiters: row delimiters and column delimiters. A row delimiter is a special character that indicates that you’ve reached the end of a row, and any additional data to the right of it should be considered part of the next row. The most common row delimiter is simply a newline character (\n) or a carriage return followed by a newline character (\r\n). Delimiting rows by line is a standard choice because it provides the additional benefit of breaking up the raw data visually and reflects the layout of a spreadsheet.

to see more go to 4 Loading data into DataFrames

Listing 9.5 A function to find the next occurrence of a delimiter in a file handle
from dask.delayed import delayed

def get_next_part(file, start_index, span_index=0, blocksize=1000):
    file.seek(start_index)    #1  
    buffer = file.read(blocksize + span_index).decode('cp1252')    #2  
    delimiter_position = buffer.find('\n\n')
    if delimiter_position == -1:    #3  
        return get_next_part(file, start_index, span_index + blocksize)
    else:
        file.seek(start_index)
        return start_index, delimiter_position

#1   Make sure the file handle is currently pointing at the correct starting position.
#2   Read the next chunk of bytes, decode into a string, and search for the delimiter.
#3   If the delimiter isn’t found (find returns −1), recursively call the get_next_part function to search the next chunk of bytes; otherwise, return the position of the found delimiter.
copy
Given a file handle and a starting position, this function will find the next occurrence of the delimiter. The recursive function call that occurs if the delimiter isn’t found in the current buffer adds the current buffer size to the span_index parameter. This is how the window continues to expand if the delimiter search fails. The first time the function is called, the span_index will be 0. With a default blocksize of 1000, this means the function will read the next 1000 bytes after the starting position (1000 blocksize + 0 span_index). If the find fails, the function is called again after incrementing the span_index by 1000. Then it will then try again by searching the next 2000 bytes after the starting position (1000 blocksize + 1000 span_index). If the find continues to fail, the search window will keep expanding by 1000 bytes until a delimiter is finally found or the end of the file is reached. A visual example of this process can be seen in figure 9.5.

Figure 9.5 A visual representation of the recursive delimiter search function

To find all instances of the delimiter in the file, we can call this function inside a loop that will iterate chunk by chunk until the end of the file is reached. To do this, we’ll use a while loop.
Listing 9.6 Finding all instances of the delimiter
with open('foods.txt', 'rb') as file_handle:
    size = file_handle.seek(0,2) – 1    #1  
    more_data = True    #2  
    output = []
    current_position = next_position = 0
    while more_data:
        if current_position >= size:    #3  
            more_data = False
        else:
            current_position, next_position = get_next_part(file_handle, current_position, 0)
            output.append((current_position, next_position))
            current_position = current_position + next_position + 2

#1   Get the total size of the file in bytes.
#2   Initialize a few variables to control the loop and store the output, starting at byte 0.
#3   If the end of the file has been reached, terminate the loop; otherwise, find the next instance of the delimiter starting at the current position; append the results to the output list, and update the current position to after the delimiter.
copy

to see more go to 9 Working with Bags and Arrays

After initializing a few variables, we enter the loop starting at byte 0. Every time a delimiter is found, the current position is advanced to just after the position of the delimiter. For instance, if the first delimiter starts at byte 627, the first review is made up of bytes 0 through 626. Bytes 0 and 626 would be appended to the output list, and the current position would be advanced to 628. We add two to the next_position variable because the delimiter is two bytes (each ‘\n’ is one byte). Therefore, since we don’t really care about keeping the delimiters as part of the final review objects, we’ll skip over them. The search for the next delimiter will pick up at byte 629, which should be the first character of the next review. This continues until the end of the file is reached. By then we have a list of tuples. The first element in each tuple represents the starting byte, and the second element in each tuple represents the number of bytes to read after the starting byte. The list of tuples looks like this:

to see more go to 9 Working with Bags and Arrays

concept delimiter in category dask

Data Science with Python and Dask

Listing 9.5 A function to find the next occurrence of a delimiter in a file handle

Figure 9.5 A visual representation of the recursive delimiter search function

Listing 9.6 Finding all instances of the delimiter

Unable to load book!

concept `delimiter` in category `dask`