Thursday, March 27, 2014

Strings in Jython - Part 1

Strings are one of the most common objects used when scripting. In this post I will introduce how to work with strings in Jython. Besides I will show two examples of parsing source files when our data contains string delimiters (' or ") to avoid errors when importing.

Strings in Jython
  • A string is a list of characters in order (letter, a number, backslash, quote...)
  • Jython recognizes as strings everything that is delimited by quotation marks (" " or ' ')
  • String/Array index starts at 0
  • We use [ ] to access characters in a string

  • Strings can be enclosed with single quotes 'mystring'
  • Strings can also be enclosed with double quotes "mystring"
  • Strings can also be enclosed in triple quotes when multi-lines are used
  • Substring is str[Start Position: Start Position + Length]
Raw Strings
Raw strings are used when we want to express strings that would be modified by escape sequence processing. They are very useful when using paths as literals. Raw strings are prefixed by r.
In the following example we want to print the new line character:
Operating with Strings
Although there are many functions and operators for Strings, the most common are:
  • Concatenation (+): combines 2+ strings into new string object
  • Repeat (*n): generates a string with n characters
  • split (): splits a string into a string array using the separator defined. If not separator is defined, whitespace is used
  • startswith/endswith: checks if a string starts or ends with a substring
  • strip: strip(), lstrip(), rstrip() are methods for removing any character from both ends of a string. If the character to be removed are not specified then whitespace will be removed. Same functionality as TRIM in VB.
  • join: takes a list of strings and joins them together with the calling string in between each element
  • Test: a string can be tested for truth value (T or F)
  • Replacing templates (%): the percent "%" character marks the start of the string to be included
    • %s used for strings
    • %d used for numbers
    • %f used for floating points

You can find more details about working with strings (functions, methods, operators...) in: http://www.jython.org/docs/library/string.html

Array Slice Notation
We can use slice notation to access characters or substrings of a string.
They key point is to understand that :end represents the first value that is not in the selected slice.
So the difference between end and start is the number of elements selected.
The other key point is that start or end may be a negative number, which means it counts from the end of the array instead of the beginning.
You will have to take care with situations where there are fewer items than you ask for. Jython will not return an error but an empty list. On the other hand, if you just want to access a position by using [ ] then you will get an error if the position does not exist:
 
Example 1: my account description has apostrophe character
Let's say we have the following source file where our Description has apostrophe (single quote):
If we try to import the file we can see that any of the three rows are imported:
The process log shows exceptions raised while executing the import format to AMOUNT field:
It seems that our apostrophe is causing the issue. FDMEE thinks that the single quote delimits the data row when it is passed to the import format process and therefore it does not find any amount in the line:

EastSales;LaborCost;akaFDMEE's example;[ICP None];Comma_PDAs;NoCustomer;NoChannel;[None];1000

We can easily test it by removing the single quote to see how data is imported. But we don't want this :-)
We would like to parse the file so the amount is processed and rows are successfully imported.
What are the key points for our solution?
- Replace apostrophe by empty string when processing amounts
- Need to parse any field after the description having the apostrophe
The following script replaces the apostrophe (single quote) by empty string and then return the 9th column which is actually the amount:
This script can be assigned to the amount field in the import format:
If we execute the import with the new import format we can verify that columns are imported only until Description:
So the real issue is when importing the Description into FDMEE. Any of the fields which number is greater than the description one are imported. FDMEE thinks the apostrophe is delimiting the row so it does not find fields greater than 3.
We have to create one script for each dimension...and set field number to 1 for all of them.
Don't want to do all of this? I would suggest moving the description or the field having the apostrophe to the last column. In this way we don't need any parse script (including amount) in order to get amount imported :-) 
All data will be imported successfully: 

Example 2: all my columns are enclosed in double quotes
There are many situations where files are exported from source systems with the following format:
As comma is the delimiter and amount is using Euro notation (comma as decimal delimiter), all fields are enclosed in double quotes " " in order to have the correct number of fields.
Let's try to import this file and see the results:
We can see that double quotes are not stripped from other columns but amount.
There are many solutions...one of them consists in stripping double quotes for each dimension. The following script will do that:
We can assign it to all columns so when we import data again all data looks good now:
Other solutions? give a try!

Enjoy!



 

 

9 comments:

  1. Hello,

    nice post, I was also struggling with apostroph.

    You are replacing apostrophe for column 9 but in the jython script you have put column 8 so we have to do -1 in the script for each column ?

    Thanks!

    ReplyDelete
  2. Hi,
    array positions starts from 0 so:
    - Column 1 --> Position 0
    - Column 2 --> Position 1
    ...
    - Column 9 --> Position 8

    ReplyDelete
  3. Hello,
    Did you put the script on the line "Amount" (in Import format) only because it is the last column of your CSV file ?
    Thanks in advance

    ReplyDelete
  4. Hi,

    amount is the first field to be processed by FDM/FDMEE.
    If you don't assign to amount and FDMEE finds amount is not numeric, not present...it will skip the line.

    ReplyDelete
  5. Hello again,

    When i write the script and assign it to the amount column, there is still a error during data loading :

    [SKIPED FROM SCRIPT ]

    Thanks in advance

    ReplyDelete
  6. Hi,
    sorry for the delay, You may be skipping rows from your script. I would recommend you review your script.

    Regards

    ReplyDelete
  7. Hi, I am trying to load values that have $ in amount field. i.e. "$100.00" or "($239,001.47)".
    I am able to read in the first example by returning strfield.strip("$"). However I am having trouble rading in the 2nd example. I tried rstip and lstrip to remove the ( and ) respectively, then stripping the $ to my variables etc. If that returned 239,001.47, idea was to use fload and multiple by -1. However I can't strip the brackets. Any idea?

    ReplyDelete
  8. Francisco,

    Thanks for the post, it was very helpful.

    ReplyDelete

Thanks for feedback!