offset-based hash table for ASCII data - Unix

This is a discussion on offset-based hash table for ASCII data - Unix ; Rainer Weikusat wrote: > If it would, then this would be the type of system you already have, > because you already have one representation on the server, a transport > representation and a representation on the client. That you ...

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 21 to 40 of 44

Thread: offset-based hash table for ASCII data

  1. Re: offset-based hash table for ASCII data

    Rainer Weikusat wrote:
    > If it would, then this would be the type of system you already have,
    > because you already have one representation on the server, a transport
    > representation and a representation on the client. That you
    > (presently) construct identical client representations x times in
    > parallell and manage to overwhelm the 'client machine' with this
    > nonsense doesn't change the structure of the application, just its
    > "simple-mindedness".


    You may have me (the OP) mixed up with someone else. I write not to
    defend the current implementation but to improve it.

    > Parsing the XML-steganogram once and re-using it
    > x - 1 times would be by far the simplest solution from a coding
    > standpoint (as someone else already wrote).


    Yes, with the single exception of the alternative, which is to deliver
    it in pre-parsed form and thus remove one more intermediate representation.

    N < N + 1. QED.

    >> Also, again in response to you and a few others, network bandwidth is
    >> not an issue. I.e. moving the XML document, even if it's large (and it
    >> often is) is not a significant cost,

    >
    > ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
    > which is been solely provided to transport MY XML (and it has excess
    > bandwidth for that) ...


    You are swinging (and missing) especially wildly today. The
    insignificance of transport is primarily due to the fact that it's a
    fairly rare event which occurs just once per client/server transaction,
    while the client fires children, as noted, potentially thousands of
    times per transaction.

    The largest XML file I've measured to date is around 5 megabytes.
    Because XML is so wordy it compresses quite well, and thus the gzipped
    version which travels over the wire is around 150K. This may happen
    about once every half hour. I don't know what your network looks like
    but this is well within the noise factor for mine.

    More than that, the raison d'etre of this application is to keep the
    clients from having to run unnecessary jobs. I.e. it's an optimizer. So
    whatever network traffic it generates will be paid back by orders of
    magnitude in other traffic which doesn't happen later.

    > If network usage is not of concern, why do you bother to compress the
    > XML? Not doing so would be another easy way to make the processing on
    > both client and server much simpler.


    Arg. I do it because it's trivially easy and the right thing to do. On
    the server side it's as easy as constructing a "new GZIPOutputStream"
    object. The client side uses libcurl which has built-in knowledge of
    compression so it's as easy as adding an Accept-Encoding header.

    It really couldn't be simpler than that. The XML parser doesn't have to
    worry about compression because it's uncompressed by then. See, it's
    compressed, sent over the wire, uncompressed, then parsed. All with two
    extra lines of code, one in each language.

    >> It's the multi-parsing that's a problem.

    >
    > Well, then don't.


    Arg. If you didn't spend so much energy looking for someone to bully,
    you might have more fun. I know we would.

    RM

  2. Re: offset-based hash table for ASCII data

    Rex Mottram writes:
    > Rainer Weikusat wrote:
    >> If it would, then this would be the type of system you already have,
    >> because you already have one representation on the server, a transport
    >> representation and a representation on the client. That you
    >> (presently) construct identical client representations x times in
    >> parallell and manage to overwhelm the 'client machine' with this
    >> nonsense doesn't change the structure of the application, just its
    >> "simple-mindedness".

    >
    > You may have me (the OP) mixed up with someone else.


    No, I was being sarcastic.

    >> Parsing the XML-steganogram once and re-using it
    >> x - 1 times would be by far the simplest solution from a coding
    >> standpoint (as someone else already wrote).

    >
    > Yes, with the single exception of the alternative, which is to deliver
    > it in pre-parsed form and thus remove one more intermediate
    > representation.


    Assuming a UNIX(*)-like 'application structure', the necessary changes
    would roughly amount to forking after the XML has been parsed instead of
    before it has parsed.

    >>> Also, again in response to you and a few others, network bandwidth is
    >>> not an issue. I.e. moving the XML document, even if it's large (and it
    >>> often is) is not a significant cost,

    >> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
    >> which is been solely provided to transport MY XML (and it has excess
    >> bandwidth for that) ...

    >
    > You are swinging (and missing) especially wildly today.
    > The insignificance of transport is primarily due to the fact that it's a
    > fairly rare event which occurs just once per client/server
    > transaction,


    Less creative editing could help to aid your understanding greatly:

    ,----
    | It is not possible to assess the significance of the 'cost' of a
    | particular communication isolated. Networks are usually used by
    | multiple (and often, many) computers for multiple (often many)
    | applications in parallell.
    `----

    See. This time, I even explained the sarcasm. Note that this was a
    general remark regarding the (common) misconception of the 'isolated,
    insignificant event', which really isn't an 'isolated event' but a
    contributing factor.

    [...]

    >> If network usage is not of concern, why do you bother to compress the
    >> XML? Not doing so would be another easy way to make the processing on
    >> both client and server much simpler.

    >
    > Arg. I do it because it's trivially easy and the right thing to do.


    Something which is 'trivially easy' for you to code is not necessarily
    'trivially easy' for a computer to execute. In this particular
    situation, it is certain that the process is computionally expensive.
    You (again) stated that 'network usage' is of no concern to you, while
    you (again) stated that 'CPU usage on the client' is of concern to
    you. Yet you try to save 'network bandwidth' at the expense of
    'processing time' by using compression.

    This doesn't exactly make sense together.

    [...]

    >>> It's the multi-parsing that's a problem.

    >> Well, then don't.

    >
    > Arg. If you didn't spend so much energy looking for someone to bully,
    > you might have more fun. I know we would.


    That you know that some or all of you would have more fun if some or
    all of you spent less energy looking for someone to bully means
    nothing about me.

  3. Re: offset-based hash table for ASCII data

    Rainer Weikusat wrote:
    > Assuming a UNIX(*)-like 'application structure', the necessary changes
    > would roughly amount to forking after the XML has been parsed instead of
    > before it has parsed.


    No, the "necessary changes" would include being able to remove an entire
    XML-processing state machine, many associated static variables, remove
    dependence on an external XML parsing library, not to mention (again)
    the extra representation. These are very significant differences. I
    truly think you've switched your POV here, either without realizing it
    or out of pure orneriness.

    > Less creative editing could help to aid your understanding greatly:
    >
    > ,----
    > | It is not possible to assess the significance of the 'cost' of a
    > | particular communication isolated. Networks are usually used by
    > | multiple (and often, many) computers for multiple (often many)
    > | applications in parallell.
    > `----
    >
    > See. This time, I even explained the sarcasm. Note that this was a
    > general remark regarding the (common) misconception of the 'isolated,
    > insignificant event', which really isn't an 'isolated event' but a
    > contributing factor.


    Speaking of creative editing, I gather you chose to remove the parts of
    my response which addressed this. Unlike you I know things about how my
    network is used and have done measurements. Sending the XML compressed
    is, empirically, an insignificant cost. Sending it uncompressed is
    probably insignificant too. The only reason I said anything about the
    transport cost and compressed vs not is that some respondents mistakenly
    thought that was the problem and I wanted to set them straight.

    There is no possible way that the cost of transporting this data packet
    is a significant fraction of the overall network burden, or network
    profile, of this application or the environment it operates in. You may
    repeat mantras like the above to your heart's content, but though they
    are certainly true in isolation they are not applicable here. This whole
    network bandwidth discussion is a red herring.

    >>> If network usage is not of concern, why do you bother to compress the
    >>> XML? Not doing so would be another easy way to make the processing on
    >>> both client and server much simpler.

    >> Arg. I do it because it's trivially easy and the right thing to do.

    >
    > Something which is 'trivially easy' for you to code is not necessarily
    > 'trivially easy' for a computer to execute. In this particular
    > situation, it is certain that the process is computionally expensive.
    > You (again) stated that 'network usage' is of no concern to you, while
    > you (again) stated that 'CPU usage on the client' is of concern to
    > you. Yet you try to save 'network bandwidth' at the expense of
    > 'processing time' by using compression.
    >
    > This doesn't exactly make sense together.


    For you to accuse me of creative editing and in the *same post* put a
    phrase in quotes, attributed directly to me, which I did not say is
    truly breathtaking. Search for the phrase 'CPU usage on the client' -
    its first appearance is when you "quoted" it.

    The point here is that all the one-time costs are swamped by the
    aggregate costs of parsing. This applies equally to the compression,
    transmission, and decompression of the XML, and many other constant
    costs. They are just not measurable. I don't know why you insist on
    talking about them.

    RM

  4. Re: offset-based hash table for ASCII data

    On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram wrote:

    >The point here is that all the one-time costs are swamped by the
    >aggregate costs of parsing.

    Then change the parsing from an aggregate cost to a one-time cost.
    Parse once at the receiving end and let all the child processes have
    access to the read-only parsed version. It would not be too complex
    to set up version numbering or reference counting if required.

    I am not clear as to why you cannot do this, it seems such an obvious
    solution.

    rossum


    >This applies equally to the compression,
    >transmission, and decompression of the XML, and many other constant
    >costs. They are just not measurable. I don't know why you insist on
    >talking about them.



  5. Re: offset-based hash table for ASCII data

    Mark Space wrote:
    > If you find anything interesting, let us know.


    I apologize for letting other parts of this thread devolve into a
    useless squabble. But for those of you with a substantive interest, let
    me report back that I'm getting moderately interested in/excited about
    the CDB ("constant database") family.

    Having realized that what I'm looking at can be thought of as a constant
    (meaning write once, read many) database, I did some searching and
    found CDB. The original CDB package is C code which is unchanged since
    2000 but still works fine. There's also a "TinyCDB" fork of the project
    which is newer and ported to Windows and still format-compatible with
    the original.

    Most interestingly, there are APIs for a number of languages including a
    jar file for Java (this version is called sg-cdb). So it looks like I
    can use the sg-cdb API to write a cdb-format file directly from Java and
    send it to the client(s), which can link with libcdb.a and navigate it
    directly.

    Whether I can encode my data into the key-value style of CDB (which is a
    dbm-like model) is an open question but that's my problem. For the sake
    of readers here and subsequent archive searches I thought I'd mention
    that the combination of sg-cdb on the server side and tinycdb on the
    client side may make a lot of sense for certain applications.

    Licensing is also simple: TinyCDB is in the public domain and sg-cdb is
    BSD licensed.

    RM

  6. Re: offset-based hash table for ASCII data

    rossum wrote:
    > On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram wrote:
    >
    >> The point here is that all the one-time costs are swamped by the
    >> aggregate costs of parsing.

    > Then change the parsing from an aggregate cost to a one-time cost.
    > Parse once at the receiving end and let all the child processes have
    > access to the read-only parsed version. It would not be too complex
    > to set up version numbering or reference counting if required.
    >
    > I am not clear as to why you cannot do this, it seems such an obvious
    > solution.


    I thought I had explained this in excruciating detail but will try once
    more. There are two problems with this approach, not necessarily
    unsolvable but still valid problems:

    1. It means double complexity, as the data has to be extracted from
    POJOs on the server side and converted to XML ("format A"), then parsed
    and placed into an alternate representation ("format B") on the client
    side. The preferable plan would be to have the server place the data
    directly into format B. I am not clear on why this is complex or
    controversial; it seems beyond argument to me. Particularly since I've
    already found one API for doing so (CDB) as mentioned elsewhere in the
    thread.

    2. In what format would that "read-only parsed version" be? It cannot be
    a traditional hash table or other ADT since they can't be shared between
    processes, as explained previously.

    The whole point is that I'm looking for a format in which the data can
    be shared. It looks like the CDB format might be a candidate, which
    allows me to turn the question around: whyever would I want to write the
    data to XML on the server and convert it to CDB on the client when
    there's an API for writing the CDB format directly on the server? I can
    see no argument at all for using two steps when one would suffice.

    RM

  7. Re: offset-based hash table for ASCII data

    rossum writes:
    > On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram wrote:
    >
    >>The point here is that all the one-time costs are swamped by the
    >>aggregate costs of parsing.

    > Then change the parsing from an aggregate cost to a one-time cost.
    > Parse once at the receiving end and let all the child processes have
    > access to the read-only parsed version. It would not be too complex
    > to set up version numbering or reference counting if required.
    >
    > I am not clear as to why you cannot do this, it seems such an obvious
    > solution.


    The short answer is that the guy is a complete moron.

  8. Re: offset-based hash table for ASCII data

    Rex Mottram wrote:
    > Mark Space wrote:
    >> If you find anything interesting, let us know.


    > me report back that I'm getting moderately interested in/excited about
    > the CDB ("constant database") family.


    > Most interestingly, there are APIs for a number of languages including a
    > jar file for Java (this version is called sg-cdb). So it looks like I


    Thanks for the report. I learned some things this thread, which is
    always valuable to me, and this project you found will be an cool tool I
    can keep in my back pocket in case I ever need such a thing.

    I hope this project goes well for you.

  9. Re: offset-based hash table for ASCII data

    On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:


    > 2. In what format would that "read-only parsed version" be? It cannot be
    > a traditional hash table or other ADT since they can't be shared between
    > processes, as explained previously.


    No need to parse it.
    Just a native OS-flatfile. Since your data + keys is ASCII,
    this flatfile is an ASCII file.

    > The whole point is that I'm looking for a format in which the data can
    > be shared. It looks like the CDB format might be a candidate, which
    > allows me to turn the question around: whyever would I want to write the
    > data to XML on the server and convert it to CDB on the client when
    > there's an API for writing the CDB format directly on the server? I can
    > see no argument at all for using two steps when one would suffice.


    You can store an array of {offset, hash,(key)len,next} in
    another flatfile.
    This second flatfile needs to be binary, but can be created on the target
    platform. The next-field in the struct is not a pointer but an index into
    the array itself. (for the overflow chain). (NB use -1 or >=N for
    sentinel value. zero wont work ;-)

    Using mmap on the target-platform will cause the lookup extremely fast
    (typically 2 page faults). Allocating two big arrays and reading them in
    at program startup will be faster (or equally fast), but the startup will
    be slower.

    Combinining the two files is a possible extension, too.



    HTH,
    AvK

  10. Re: offset-based hash table for ASCII data

    On Wed, 16 Apr 2008 16:41:23 +0200, Rainer Weikusat wrote:

    > Rex Mottram writes:
    >> Arg. I do it because it's trivially easy and the right thing to do.

    >
    > Something which is 'trivially easy' for you to code is not necessarily
    > 'trivially easy' for a computer to execute. In this particular
    > situation, it is certain that the process is computionally expensive.
    > You (again) stated that 'network usage' is of no concern to you, while
    > you (again) stated that 'CPU usage on the client' is of concern to you.
    > Yet you try to save 'network bandwidth' at the expense of 'processing
    > time' by using compression.
    >
    > This doesn't exactly make sense together.
    >


    Maybe it is faster for the client to decompress the little file, then it
    would take to download the uncompressed version?

  11. Re: offset-based hash table for ASCII data

    jellybean stonerfish writes:
    > On Wed, 16 Apr 2008 16:41:23 +0200, Rainer Weikusat wrote:
    >> Rex Mottram writes:
    >>> Arg. I do it because it's trivially easy and the right thing to do.

    >>
    >> Something which is 'trivially easy' for you to code is not necessarily
    >> 'trivially easy' for a computer to execute. In this particular
    >> situation, it is certain that the process is computionally expensive.
    >> You (again) stated that 'network usage' is of no concern to you, while
    >> you (again) stated that 'CPU usage on the client' is of concern to you.
    >> Yet you try to save 'network bandwidth' at the expense of 'processing
    >> time' by using compression.
    >>
    >> This doesn't exactly make sense together.

    >
    > Maybe it is faster for the client to decompress the little file,
    > then it would take to download the uncompressed version?


    It would be much faster to just not download any file ;-), ie the code
    on the client (presumably) doesn't do something because it is 'faster'
    than doing something else, but to accomplish some useful purpose. The
    same way 'the network' is shared among all applications trying to use
    it, the host CPU is shared among all applications running on a
    particular host. Download of a large file could go on as background
    task, using very little host ressources, more of which would then
    be available to 'other active tasks', while downloading a smaller file
    faster would have a more 'bursty' host ressource usage pattern and
    decompressing a small file to a large file is a ressource-intense
    operation, although the absolute time needed by it may be short enough
    that measuring its influence could be difficult.

    This is all, of course, mostly 'thin air' without much more
    information about the actual application. But a statement like 'I do
    not care about transmitting much more data than I actually need,
    except that I compress these data to make it smaller, but anyway, my
    actual problem is that preprocessing of the downloaded data is too
    expensive for a particular situation' doesn't make much sense
    altogether. To me, it sounds a lot like 'I am trading off processor
    time for transmission time and now my programs run to slow'. Duh.

    If doing it all 'in the right way' doesn't work, then it isn't the
    right way.



  12. Re: offset-based hash table for ASCII data

    moi writes:
    > On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
    >> 2. In what format would that "read-only parsed version" be? It cannot be
    >> a traditional hash table or other ADT since they can't be shared between
    >> processes, as explained previously.

    >
    > No need to parse it. Just a native OS-flatfile. Since your data +
    > keys is ASCII, this flatfile is an ASCII file.


    [...]

    > You can store an array of {offset, hash,(key)len,next} in
    > another flatfile. This second flatfile needs to be binary, but can
    > be created on the target platform.


    Which requires the data to be parsed on the target platform to create
    the index. It just (presumably) requires less complicated parsing
    than parsing the XML file. The most sensible way to do this would
    still be to preprocess the transport format once and share the result
    among all client instances. The OP has repeatedly went ballistic at
    the mere suggestion that this could be a sensible strategy.

    Not to mention that a non-XML transport format would still need to be
    designed. But this would (likely) lead to reducing the amount of data
    transmitted, so, again cause Mr Rex' mind to blow up in anger.



  13. Re: offset-based hash table for ASCII data

    On Thu, 17 Apr 2008 11:56:03 +0200, Rainer Weikusat wrote:

    > moi writes:
    >> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
    >>> 2. In what format would that "read-only parsed version" be? It cannot
    >>> be a traditional hash table or other ADT since they can't be shared
    >>> between processes, as explained previously.

    >>
    >> No need to parse it. Just a native OS-flatfile. Since your data + keys
    >> is ASCII, this flatfile is an ASCII file.

    >
    > [...]
    >
    >> You can store an array of {offset, hash,(key)len,next} in another
    >> flatfile. This second flatfile needs to be binary, but can be created
    >> on the target platform.

    >
    > Which requires the data to be parsed on the target platform to create
    > the index. It just (presumably) requires less complicated parsing than
    > parsing the XML file. The most sensible way to do this would still be to
    > preprocess the transport format once and share the result among all
    > client instances. The OP has repeatedly went ballistic at the mere
    > suggestion that this could be a sensible strategy.


    I am afraid you are right. As an experiment, I used /usr/dict ~5MB, >500K
    words, and created the hashtable in approx 2 seconds walltime. Once the
    hashtable is present on disk, searching one key takes 2 ms or so (most of
    it program startup, of course) All using mmap.
    Creating a binary file for another architecture is not impossible to do
    (even dd has options to swap words). Most hardware is little-endian,
    anyway these days, and most people don't get beyond x86, anyway.

    Agreed, my parser is simple (just one sweep, looking for space and CR; an
    XML parse may need more than one pass, and a lot more state. But programs
    like these are governed by I/O cost, and one can easily perform 10 scans
    over each buffer being read in. (once it *is* read in ;-)

    >
    > Not to mention that a non-XML transport format would still need to be
    > designed. But this would (likely) lead to reducing the amount of data
    > transmitted, so, again cause Mr Rex' mind to blow up in anger.


    XML is the problem. Sticking to it is more than a problem. It is a
    disease.

    AvK

  14. Re: offset-based hash table for ASCII data

    Rex Mottram writes:

    > The traditional issue with transportable data structures is that since
    > the client can't reliably control what address the data is mapped to,
    > all addressing must be made relative to the starting point. Does
    > anyone know of an implementation of such a format which can be
    > generated in Java and consumed in C code?


    IIUC, what you need is a hash table. Traditionally, a hash table is
    implemented as an array of pointers, where the pointers point to linked
    lists of objects that hash to the same value, and that value is the
    index into the array.

    So the fundamental problem is that pointers on the server contain
    addresses that are meaningless on the client.

    I've often seen linked lists implemented using array indices instead of
    pointers. For example:

    #include
    #include
    #include

    struct list {
    int next;
    const char *data;
    };

    static struct list pool[1024];
    static int freeidx;

    static int pool_alloc(void)
    {
    int tmp = freeidx;

    if (freeidx != -1)
    freeidx = pool[freeidx].next;

    return tmp;
    }

    static void pool_free(int idx)
    {
    pool[idx].next = freeidx;
    freeidx = idx;
    }

    static int hashtable[256];

    const char *strings[] = {
    "hello",
    "world",
    "this",
    "is",
    "a",
    "hashtable",
    "example"
    };

    static unsigned char hashfunc(const char *str)
    {
    unsigned int hash = 0;

    for (hash = 0; *str; ++str)
    hash += *str;

    return hash & 0xff;
    }

    static void hash_insert(const char *str)
    {
    int hidx = hashfunc(str);
    int pidx = pool_alloc();

    if (pidx == -1) {
    fputs("out of memory\n", stderr);
    exit(1);
    }

    pool[pidx].data = str;
    pool[pidx].next = hashtable[hidx];
    hashtable[hidx] = pidx;
    }

    static int hash_find(const char *str)
    {
    int hidx = hashfunc(str);
    int iter;

    for (iter = hashtable[hidx]; iter != -1; iter = pool[iter].next)
    if (strcmp(str, pool[iter].data) == 0)
    break;

    return iter != -1;
    }

    int main(int argc, char *argv[])
    {
    int i;

    /* initialize the pool */
    for (i=0; i pool[i].next = i+1;
    pool[i-1].next = -1;

    /* initialize the hash table */
    memset(hashtable, -1, sizeof(hashtable));

    for (i=0; i hash_insert(strings[i]);

    for (i=1; i printf ("%s%s found in table\n", argv[i],
    (hash_find(argv[i]) ? "" : " not"));

    return 0;
    }


    --
    Charles M. "Chip" Coldwell
    "Turn on, log in, tune out"
    GPG Key ID: 852E052F
    GPG Key Fingerprint: 77E5 2B51 4907 F08A 7E92 DE80 AFA9 9A8F 852E 052F

  15. Re: offset-based hash table for ASCII data

    On 16.04.2008 18:01, Rex Mottram wrote:
    > Mark Space wrote:
    >> If you find anything interesting, let us know.

    >
    > I apologize for letting other parts of this thread devolve into a
    > useless squabble. But for those of you with a substantive interest, let
    > me report back that I'm getting moderately interested in/excited about
    > the CDB ("constant database") family.
    >
    > Having realized that what I'm looking at can be thought of as a constant
    > (meaning write once, read many) database, I did some searching and
    > found CDB. The original CDB package is C code which is unchanged since
    > 2000 but still works fine. There's also a "TinyCDB" fork of the project
    > which is newer and ported to Windows and still format-compatible with
    > the original.
    >
    > Most interestingly, there are APIs for a number of languages including a
    > jar file for Java (this version is called sg-cdb). So it looks like I
    > can use the sg-cdb API to write a cdb-format file directly from Java and
    > send it to the client(s), which can link with libcdb.a and navigate it
    > directly.
    >
    > Whether I can encode my data into the key-value style of CDB (which is a
    > dbm-like model) is an open question but that's my problem. For the sake
    > of readers here and subsequent archive searches I thought I'd mention
    > that the combination of sg-cdb on the server side and tinycdb on the
    > client side may make a lot of sense for certain applications.
    >
    > Licensing is also simple: TinyCDB is in the public domain and sg-cdb is
    > BSD licensed.


    I tried to find a post of yours in the thread where you explain what
    kind of data structure you need but could not find one. The term
    "offset based hash table for ASCII data" is far too generic for me to
    get a concrete idea. Can you elaborate a bit or show a sample XML if it
    is not too large?

    Kind regards

    robert

  16. Re: offset-based hash table for ASCII data

    Charles Coldwell wrote:
    > Rex Mottram writes:
    >
    >> The traditional issue with transportable data structures is that since
    >> the client can't reliably control what address the data is mapped to,
    >> all addressing must be made relative to the starting point. Does
    >> anyone know of an implementation of such a format which can be
    >> generated in Java and consumed in C code?

    >
    > IIUC, what you need is a hash table. Traditionally, a hash table is
    > implemented as an array of pointers, where the pointers point to linked
    > lists of objects that hash to the same value, and that value is the
    > index into the array.
    >
    > So the fundamental problem is that pointers on the server contain
    > addresses that are meaningless on the client.


    Almost, but not quite. While it's true that pointers on the server don't
    map to pointers on the client, the problem is worse than that: you
    cannot share pointers between processes *even on the same system*.

    In order to share a hash table it would be necessary to mmap() the file
    containing it (or use a shared or malloc-ed memory region but the issues
    would be exactly the same). Unfortunately you cannot reliably ask for
    any of these addresses to be mapped at a preset place, which makes
    perfect sense when you think about it - what if something else is
    already mapped at that location? Thus the documentation sets for mmmap
    and the equivalent Win32 APIs have clear warnings to the effect that
    "you may request mapping at a particular virtual address but you cannot
    count on that request being honored".

    So the problem is not just between server and client but between all
    processes whether on server or client (and in my case there are multiple
    client-side processes sharing the data). This is where the "ship the
    data to the client in XML, then parse it into a hash table and share it
    around" proposal made by many people falls down.

    Anyway, a good solution has been found which I will describe in another
    sub-thread.

    RM

  17. Re: offset-based hash table for ASCII data

    On Sun, 20 Apr 2008 13:19:26 -0400, Rex Mottram wrote:

    > Charles Coldwell wrote:


    > In order to share a hash table it would be necessary to mmap() the file
    > containing it (or use a shared or malloc-ed memory region but the issues
    > would be exactly the same). Unfortunately you cannot reliably ask for
    > any of these addresses to be mapped at a preset place, which makes


    No. you don't *need* to mmap the array to a fixed address.
    Just make sure all it's members are offsets /indexes; not pointers.
    period.
    [ see my previous posts (and others) in comp.programmig ]

    > perfect sense when you think about it - what if something else is
    > already mapped at that location? Thus the documentation sets for mmmap
    > and the equivalent Win32 APIs have clear warnings to the effect that
    > "you may request mapping at a particular virtual address but you cannot
    > count on that request being honored".
    >
    > So the problem is not just between server and client but between all
    > processes whether on server or client (and in my case there are multiple
    > client-side processes sharing the data). This is where the "ship the
    > data to the client in XML, then parse it into a hash table and share it
    > around" proposal made by many people falls down.


    The problem is that you failed to describe your problem.

    > Anyway, a good solution has been found which I will describe in another

    The solution is trivial. Presenting the problem was much harder.

    AvK

  18. Re: offset-based hash table for ASCII data

    Robert Klemme wrote:
    > I tried to find a post of yours in the thread where you explain what
    > kind of data structure you need but could not find one. The term
    > "offset based hash table for ASCII data" is far too generic for me to
    > get a concrete idea. Can you elaborate a bit or show a sample XML if it
    > is not too large?


    I don't know how many other people will be interested in reading to this
    depth but for the record, here's a (long) synopsis. There are 3
    elemental concepts: Activities, States, and Transactions. Activities are
    operations which read some States and modify others. Every time an
    Activity takes place, all affected states are logged (in a server-side
    database) as a Transaction.

    Now, these Activities can be pretty slow/expensive to run. Therefore we
    look for opportunities to skip them. Specifically, if a previous
    Transaction was run on the identical set of input States, we can skip
    the new Activity and instead log a pointer back to the previous
    Transaction. This is a _really_ big win when possible.

    Since XML must be parsed sequentially and feeling that a DOM model would
    be too expensive, we went to some lengths to define a schema which could
    accommodate the above logic in a single (SAX) pass. The result was
    something like the following 3 stanzas (naturally this will have lots of
    angle brackets and I can't promise it will render correctly in HTML
    readers):





  19. Re: offset-based hash table for ASCII data

    On 20.04.2008 20:36, Rex Mottram wrote:
    > Robert Klemme wrote:
    >> I tried to find a post of yours in the thread where you explain what
    >> kind of data structure you need but could not find one. The term
    >> "offset based hash table for ASCII data" is far too generic for me to
    >> get a concrete idea. Can you elaborate a bit or show a sample XML if
    >> it is not too large?

    >
    > I don't know how many other people will be interested in reading to this
    > depth but for the record, here's a (long) synopsis.


    Thanks for elaborating!

    > There are 3
    > elemental concepts: Activities, States, and Transactions. Activities are
    > operations which read some States and modify others. Every time an
    > Activity takes place, all affected states are logged (in a server-side
    > database) as a Transaction.


    With affected states do you mean prior or post activity (or both)? Is
    the transaction list you transfer to clients always the complete history?

    > Now, these Activities can be pretty slow/expensive to run. Therefore we
    > look for opportunities to skip them. Specifically, if a previous
    > Transaction was run on the identical set of input States, we can skip
    > the new Activity and instead log a pointer back to the previous
    > Transaction. This is a _really_ big win when possible.


    So you need a quick match based on (activity, (input state1, ...input
    stateN)), do you?

    > Since XML must be parsed sequentially and feeling that a DOM model would
    > be too expensive, we went to some lengths to define a schema which could
    > accommodate the above logic in a single (SAX) pass. The result was
    > something like the following 3 stanzas (naturally this will have lots of
    > angle brackets and I can't promise it will render correctly in HTML
    > readers):
    >
    >
    >
    >
    > .
    > .
    > .
    >


    You mention that transactions involve state but I do not see it here.

    >
    >
    >
    > .
    > .
    > .
    >

    >
    >
    >
    >
    >
    > .
    > .
    > .
    >

    >
    >
    > This leaves out many details but the basic structure is fairly clear.
    > The client code parses through the Transaction section, storing a
    > reference to each transaction in a local hash table known as the TX
    > table. Then it walks through the set of activities, doing a
    > (necessarily) linear search for the current Activity.


    How do you determine from the structure above which is the current one?
    There must be something (I am) missing.

    > Once that is
    > matched it sets a "found" flag and chews through the rest of the
    > Activity stanza until reaching the States stanza.
    >
    > Next, each State is compared vs current reality.


    So you have an additional input, current state(s)?

    > Whenever a State
    > mismatch is encountered, the Transactions associated with the mismatched
    > State are deleted from the TX table. If we reach the end of the States
    > and the TX table is not empty, what remains are matches. If at any
    > intermediate point we find that the TX table has become empty, we can
    > abort the parse and just go run the Activity.


    In other words: if after looking at all states the TX table is non
    empty, you do not need to run the activity (and save a lot of time).

    > What makes this preferable to a DOM design is the opportunity to abort
    > the parse as soon as potential matches are exhausted; DOM would have
    > required a full parse followed by some memory accesses.


    I find DOM rarely the best choice. Even for other types of XML
    processing (that do not involve early exit) SAX is usually a much better
    (faster and memory savvier) choice.

    > This is the current model and it works quite stably, but unfortunately
    > not fast enough. XML parsing proves to be the dominant time cost. Since
    > the Transaction stanza is small relative to the other two, which can
    > grow quite large, it seems quite likely that it's the linear search
    > aspect of the latter two stanzas that's the problem.
    >
    > Thus the key requirement for the "XML replacement technology" is that it
    > be random access. A hash table is the obvious candidate, and that's
    > exactly what CDB turned out to be, but a tree-based structure such as a
    > red-black tree might have been fine too. Anyway, CDB turned out to be
    > everything I was looking for.


    And apparently you solved the encoding issue. That's good to hear.

    (reading CDB)

    My recommendations would have gone into a similar direction: define a
    binary format that satisfies your lookup requirements and create read
    write code in Java and read code in C. If files are small enough you
    can memory map them, otherwise you have to use an approach using
    indexing techniques as CDB does. But since you found your solution we
    don't need to waste more brain cycles.

    Thanks again. And please let us know the outcome (aka speed
    improvement) or any issues you find along the way. That way we can all
    learn a bit. :-)

    Kind regards

    robert

  20. Re: offset-based hash table for ASCII data

    Robert Klemme wrote:
    > With affected states do you mean prior or post activity (or both)? Is
    > the transaction list you transfer to clients always the complete history?


    We _store_ both precondition states and result states of each
    transaction. But of course when looking for a prior match we need only
    compare preconditions.

    > So you need a quick match based on (activity, (input state1, ...input
    > stateN)), do you?


    Precisely.

    > You mention that transactions involve state but I do not see it here.


    I did say I was leaving out many details ... yes, in the XML there are
    attributes mapping each transaction to a list of states. In fact many
    attributes (represented by "...") were left out.

    >> This leaves out many details but the basic structure is fairly clear.
    >> The client code parses through the Transaction section, storing a
    >> reference to each transaction in a local hash table known as the TX
    >> table. Then it walks through the set of activities, doing a
    >> (necessarily) linear search for the current Activity.

    >
    > How do you determine from the structure above which is the current one?
    > There must be something (I am) missing.


    Yes, I left that out as an implementation detail. The client determines
    its current activity and then chews through the Activities stanza
    looking for a match.

    > So you have an additional input, current state(s)?


    Both current States and current Activity are discoverable by the client
    on its own, so whether that's considered an input or not is debatable
    (but not worth debating).

    > In other words: if after looking at all states the TX table is non
    > empty, you do not need to run the activity (and save a lot of time).


    Yes.

    > Thanks again. And please let us know the outcome (aka speed
    > improvement) or any issues you find along the way. That way we can all
    > learn a bit. :-)


    Will do.

    RM

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast