Problem in Auto detecting Codepage - Programmer

This is a discussion on Problem in Auto detecting Codepage - Programmer ; Hi, I am trying to Auto Detect the codepage for a txt file (containing English and other language characters as well). The txt file is saved in UTF-8 format. For this i tried using IMultiLanguage2: etectInputCodepage using MLDETECTCP_NONE. In this, ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: Problem in Auto detecting Codepage

  1. Problem in Auto detecting Codepage

    Hi,

    I am trying to Auto Detect the codepage for a txt file (containing
    English and other language characters as well).
    The txt file is saved in UTF-8 format.
    For this i tried using IMultiLanguage2:etectInputCodepage using
    MLDETECTCP_NONE.


    In this, i am facing a problem that for certain files it is able to
    detect the actual codepage wheras for others it simply return English
    Codepage as output.


    Here is the relevant piece of code that i am using (CoCreateInstace
    being already done).


    if(S_OK ==
    mycodePageRecognizer.GetIMultiLanguage2(&pMultiLanguage2))
    {
    XInterface xMultiLanguage2;
    xMultiLanguage2.Set(pMultiLanguage2);
    pMultiLanguage2 = 0;


    INT pcSrcSize = myserialStream.GetNewFileSize();
    DetectEncodingInfo myEncodings[1];
    INT cEncodings = sizeof(myEncodings) /
    sizeof(DetectEncodingInfo);


    HRESULT hr =
    xMultiLanguage2.GetPointer()->DetectInputCodepage(MLDETECTCP_NONE, 0,
    pSrcStr, &pSrcSize, myEncodings, &cEncodings);


    if (SUCCEEDED(hr) && cEncodings > 0)
    {
    myulCodePage = myEncodings[0].nCodePage;
    }
    }


    Taking an example, if i am having a text file with English and Japanese

    characters, it worked fine if the file consisted of 199 character but
    was not working for 200+ characters.
    On increasing it to around 250 it again started working fine (Returned
    the correct codepage).
    I know it is not having any co-relation with the size but still giving
    it as an example.


    **** Again i am telling that the file is saved in UTF-8 format (Also it

    worked fine for any number of characters if saved in UTF_16 BE or LE
    formats).


    Please help me in finding where exactly am i going wrong.


    Thanks and Regards,
    Shreshth Luthra


  2. Re: Problem in Auto detecting Codepage

    One more thing i have tried out is to take a bigger array of
    DetectEncodingInfo structures and found out that the correct code page
    is there somewhere on the 2nd or 3rd number.
    But at the same time the last Data member confidence = -1 for it.

    Can anyone explain what can i deduce form this information. And how to
    use it to get to a better result.

    Regards,
    Shreshth

    shreshth.luthra@gmail.com wrote:
    > Hi,
    >
    > I am trying to Auto Detect the codepage for a txt file (containing
    > English and other language characters as well).
    > The txt file is saved in UTF-8 format.
    > For this i tried using IMultiLanguage2:etectInputCodepage using
    > MLDETECTCP_NONE.
    >
    >
    > In this, i am facing a problem that for certain files it is able to
    > detect the actual codepage wheras for others it simply return English
    > Codepage as output.
    >
    >
    > Here is the relevant piece of code that i am using (CoCreateInstace
    > being already done).
    >
    >
    > if(S_OK ==
    > mycodePageRecognizer.GetIMultiLanguage2(&pMultiLanguage2))
    > {
    > XInterface xMultiLanguage2;
    > xMultiLanguage2.Set(pMultiLanguage2);
    > pMultiLanguage2 = 0;
    >
    >
    > INT pcSrcSize = myserialStream.GetNewFileSize();
    > DetectEncodingInfo myEncodings[1];
    > INT cEncodings = sizeof(myEncodings) /
    > sizeof(DetectEncodingInfo);
    >
    >
    > HRESULT hr =
    > xMultiLanguage2.GetPointer()->DetectInputCodepage(MLDETECTCP_NONE, 0,
    > pSrcStr, &pSrcSize, myEncodings, &cEncodings);
    >
    >
    > if (SUCCEEDED(hr) && cEncodings > 0)
    > {
    > myulCodePage = myEncodings[0].nCodePage;
    > }
    > }
    >
    >
    > Taking an example, if i am having a text file with English and Japanese
    >
    > characters, it worked fine if the file consisted of 199 character but
    > was not working for 200+ characters.
    > On increasing it to around 250 it again started working fine (Returned
    > the correct codepage).
    > I know it is not having any co-relation with the size but still giving
    > it as an example.
    >
    >
    > **** Again i am telling that the file is saved in UTF-8 format (Also it
    >
    > worked fine for any number of characters if saved in UTF_16 BE or LE
    > formats).
    >
    >
    > Please help me in finding where exactly am i going wrong.
    >
    >
    > Thanks and Regards,
    > Shreshth Luthra



+ Reply to Thread