Tuesday, November 25, 2014

Importing files with different File Charset Encoding - Part 1

There are multiple file character set around there...
I'm sure you all are familiar with concepts like UTF-8.
Have you ever seen weird characters when importing data into FDMEE or even your file did not get imported event if there were no errors apparently?

Before starting I would like to share some points that made me better understand all of this. So I hope it also helps you.

Types of strings in Python/Jython
There are 2 types of strings: 
  • byte strings
    • string elements are called bytes
    • there are only 256 possible bytes
  • unicode strings
    • string elements are called characters
    • there are over a 1.000.000 characters defined in Unicode strings
    • very useful because we can store almost any character and can be easily manipulated
How can we convert a unicode string to a byte string?
The word we are looking for is "encode":
"An encoding is a representation of a unicode string"

Note that not all encoding support every unicode character but some subset of unicode. For this reason UTF-8 is a good one (Universal Character Set Transformation Format - bit). It supports everything or, in other words, it defines a sequence of bytes for every unicode character. In fact, it's the default encoding for FDMEE.
So after understanding above we could say that an encoding is essentially a mapping table which translates every unicode character into one byte or a sequence of bytes. And the mapping table for UTF-8 is the most complete one :-)

If we read a byte string (typically from external sources), we need to decode it in order to manipulate it as a unicode string.

Aligning concepts
Now that we have a clearer idea about these concepts we could clarify something. We typically say that either text is ASCII or UTF-8 or UTF-16 (I used to say it), and therefore bytes are text. However text is only text. When we store text we actually should talk about encoding that text into a sequence of bytes. If we talk about images, there are many different ways to encode images into bytes. I'm sure you are familiar with JPG,BMP, etc. In the same way, there are many different ways to encode text into bytes...UTF-8, UTF-16, ASCII, etc.
Once we encode, bytes are just bytes. If we want the original text, we will have to decode.

To summarize
  • Encode: Unicode > bytes 
  • Decode: bytes > Unicode
Please take into consideration that we are discussing this topic in the Jython 2.5.1 context. I say that because you may find differences if you have a look to Python 3.x where byte and string are separate types (I would recommend this article if you are interested in seeing the differences)
I don't want to confuse you (even myself) so we will skip this as FDMEE 11.1.2.3 uses Jython 2.5.1.

File Character Set Option in FDMEE
FDMEE knows source files can use different encoding charsets depending on many different factors like source system generating the file, regions, people, etc. For this reason it provides the option File Character Set (FCS) at three different levels:
  • System Level (Profile File)
  • Application Level
  • User Level
Lower levels override higher levels. For example, we could have one generic FCS, another one for some target applications, and a different ones for different users loading data to same applications:
I woull say that this option includes values for all encodings:
Today I'm going to discuss about how we could manage a different scenario in FDMEE: having same user loading different files with different FCS to same application.

Our Scenario
Our FDMEE application is designed so we have one location used to import files generated from their Legacy System PIOLIN. These files use UTF-8 encoding.
In addition to this, the user responsible of loading PIOLIN data is also responsible of loading HFM eliminations into their HP application. The HFM data is extracted using Extended Analytics (EA). The file generated uses encoding UTF16-LE and it is compressed using GZIP compressor.
Therefore we have:
  • PIOLIN files using UTF-8
  • HFM files using UTF16-LE
  • HFM files compressed with GZP (filname.csv.gz)
Our First Solution: use different users with different encoding
Let's say that we configure FCS as UTF-8 at application level. 
It would be great if we could setup different encoding for different source files but unfortunately we cannot with the current release (11.1.2.3.520)
This is because FCS option cannot be currently defined at Import Format Level. If we could do that, then our solution would be straight forward.
Let's think then. It seems that the only solution would be to create additional FDMEE users because if we set the FCS value at user level we would be overriding both System and Application options. So for each user I could define a different value for FCS option and import would be working as expecting.
I'm not going to ask if you like this solution because I don't :-). Just imagine a FDMEE user having to log off and log on to import different files. Not good.

A Better Solution: convert source file encoding to FDMEE's one
If we have our generic FCS set as UTF-8 in FDMEE, why not converting our source file encoding to UTF-8? This is a very common approach when integrating data from heterogeneous systems. It's quite normal that they use different FCS so conversion and standardization is commonly needed. 
Is that possible? Is it easy?
Nothing is straightforward but with a bit or creativeness everything is possible. 
We have already discussed about how power is Jython for FDMEE scripting.
Using import scripts would be an option but I don't like the idea of having one import script for each import field. Performance of applying multiple import scripts may not be good.
However, we know that we have an Event Script called BefImport that is executed just before our source file is imported. So what about using this event script to convert our source file encoding before it is imported?

My Source File
I will forget about GZIP compression for the moment and I will just focus on my source file:
If we have editor like Notepad++ (my best friend) then we can easily see which is the encoding used:
My file uses UCS-2 Little Endian which is the older version of UTF16-LE (you can check it here)

My first import
Before building any solution we will see what would happen with my import when FCS option is set to UTF-8 and my source file uses a different one.
So after configuring our FDMEE artifacts (Location, Import Format, Data Load Rule, etc), we import our source file and we get a grey fish:
End user would go to the Process Details page and see something that he might not understand:
It says data was imported successfully and there is a warning for the mapping step.
So if data was imported, why don't I see it? it's time to look through the FDMEE process log:
When we see all error messages above and weird characters, we can suspect that it will be probably related to file encoding. It seems that FDMEE was not able to read te file correctly. 

A quick note on this: the Process Details page will not show that mapping step failed because there was no data imported and therefore no data to be mapped. With the time you will learn how to interpret different scenarios like this. I know business users would appreciate more descriptive and intuitive messages in the Process Details page without having to open the log. I'm not inventing this, they just told me :-)

You may want to listen some advice about this: don't try to build any solution at first step if you have potential format issues, just use Notepad++ to convert to UTF-8, or change the FCS option accordingly. Once you confirm data is imported, you can start thinking in the solution.
For example, if we import the same file after updating FCS option value to UTF16-LE:
We can see our data is successfully imported:

We now know our issue is related to charset encoding so let's think about the solution...

2 comments:

  1. Did you ever write a second part to this ? :)

    ReplyDelete
  2. so are you saying you need to convert Little Endian to UTF-8? Will FDMEE not read 11.1.2.4?

    ReplyDelete

Thanks for feedback!