Extract data from complex row-style text


In data processing, sometimes data comes from text files with complex formats. To extract useful data from them, you need to think about the following points:

1 Determine the structure of the data to be extracted, think of which fields to be extracted.

2 Determine whether the text line contains valid data.

3 Find the rule of each field extracted from the valid data row. The rule of different text data may be different, but there is always a rule to be parsed.


There is a customer quotation data item.txt in text format as shown in the figure below:

The lines before the horizontal line are the complex header, and each line after it is a quotation record with blank lines between the records. The figure shown is just a header and quotation record area. Such an area will repeatedly appear in the text file. The red boxes show the Unit Price and Exp. Date field columns, and there are Quotation Number, Customer Code, Customer Name field columns in the middle, and there are spaces between the data in each column.

Now you need to extract the quotation data in the text file and save it to the Excel file as shown in the figure below:

1. Observe and find the rule in the text. We found that this text has this rule:

(1) Lines with less than 136 characters have no valid information and can be skipped.

(2) The required data is located in each row from 59 to 136 columns.

(3) Each row of the effective information according to the space delimiters split, when the first split is a numeric value type, then this row is quotation records or may be skipped. The first split value is Unit Price column, and Second is Quotation Number column, the third is Customer Code column, the last is Contract Expiry Date column, the fourth to the second from the last connected Customer Name column by spaces.


A1 Create the target data set.

A2 Open the quotation text file item.txt and read in the content of the file. The option @n means that each line is read as a text.

A3 loops through each line of text and implements the rules found in the previous section.

B3C3 If the length of this line is less than 136 and skip this line.

B4 Extract the 59th to 136th columns of the data in this row.

C4 to B4 are extracted from the data split by whitespace, option t represents the removal of the blank ends after the split, and option p represents the split text is parsed into corresponding data types.

B5C5 If the first value split by C4 is not a numeric type and skip this line.

B6 Connect the fourth value of Split by C4 to the second value from the last with space.

B7 Insert the third value split by C4, the second value split by B6 and C4, the first value, and the last value into the new record of A1 in order.

A8 saves all extracted data to Excel file item.xlsx.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store