offset-based hash table for ASCII data - Unix

This is a discussion on offset-based hash table for ASCII data - Unix ; Rex Mottram wrote: > > I don't know how many other people will be interested in reading to this > depth but for the record, here's a (long) synopsis. There are 3 I'm still reading this thread too. I hope ...

+ Reply to Thread
Page 3 of 3 FirstFirst 1 2 3
Results 41 to 44 of 44

Thread: offset-based hash table for ASCII data

  1. Re: offset-based hash table for ASCII data

    Rex Mottram wrote:

    >
    > I don't know how many other people will be interested in reading to this
    > depth but for the record, here's a (long) synopsis. There are 3


    I'm still reading this thread too. I hope I don't need anything besides
    an XML parser for exchanging data, but if I do I'm glad to know there
    are other solutions out there.

    Laziness is a virtue. Never code your own solution when you can
    download one instead.

  2. Re: offset-based hash table for ASCII data

    Rex Mottram wrote:
    >> Thanks again. And please let us know the outcome (aka speed
    >> improvement) or any issues you find along the way. That way we can
    >> all learn a bit. :-)

    >
    > Will do.


    As promised, the final results:

    A. I've had no problems with CDB in either of the versions I'm using; it
    works quite well, is clearly documented, etc. It took a few minutes to
    port TinyCDB to Windows but no big deal, and I've offered the diffs to
    the author.

    B. In the particular test case I was profiling, a typical run of the XML
    version averaged around 19-20 minutes. The CDB version is around *2*
    minutes. I consider this a rousing success.

    I was a little surprised, FWIW, to find that the CDB file was typically
    no smaller than the XML file. Given how "wordy" XML is I'd have expected
    almost any other format to be smaller, but CDB came in around the same
    size. Of course these details will vary considerably with many factors
    so this is just one data point.

    RM

  3. Re: offset-based hash table for ASCII data

    > From: Rex Mottram
    > I'm looking for an offset-based data structure to hold character data.


    I'm not sure what you mean by "offset-based data structure".
    Do you mean that there are links (pointers) from one place to
    another, such as a link from a *place* within one block of data to
    the *front* of another block of data? For example:
    Block1: imhihmgihiothmctio[ptr]pfoevpogsip
    |
    /-------------------/
    V
    Block2: erjevpigihgipgipughipru
    And instead of [ptr] showing the absolute address of Block2, it
    shows the difference between the address of Block2 and the address
    of the pointer itself?

    I definitely don't know what you mean by "character data" in this
    context. Except for pointers linking a place within one block to
    the start of another block (or whatever else you might have in mind
    that I can't guess), is the rest of the content of each block a
    sequence of characters? Are these characters fixed-size, such as
    US-ASCII or Latin-1 or UCS-2, or variable size such as UTF-8?

    > Background: I'm working with an app whose server is written in
    > Java while the client is in C. The server needs to package up a
    > bunch of data and ship it to the client where it will be traversed,
    > read-only, many many times.


    Java is certainly powerful enough to build a data-structure with
    genuine pointers between the various pieces, then compile each
    atomic block of data (not including any links) into a pure array of
    data, then create an abstract layout into a linear sequence which
    includes handles to each atomic block and tokens where each link
    from one place to another block is needed, then measure the sizes
    of all those atomic blocks and use that information to compute
    where each block and each link-place would start if laid out per
    that linear model, hence compute the offset for each link, and
    finally allocate a large block and copy all the atomic blocks and
    link-offsets into the large block per the linear model. And then
    it's trivial to write that array out to a file or communication
    channel verbatim.

    Can you please give me a small example of a structure that you
    might need to process in this way, perhaps three to five atomic
    blocks of data, nor more than ten or so bytes of data within each
    atomic block, with two to four (N-1) links connecting them into a
    structure, or maybe more than N-1 links if you allow some blocks to
    have more than one link to each? We could use that example for
    illustration of the linear-model algorithm, and eventually as test
    data for any code that would be written.

    > There's actually a working implementation which sends the data in
    > an XML format, but it's too slow and profiling reveals that it
    > spends almost all its time in XML parsing, so I need to replace the
    > XML with something more natural to the purpose.


    When you say "IT" is too slow, and when you say "IT" spends almost
    all its time in XML parsing, does "IT" mean the C code to receive
    the XML data and convert to internal data structure?

    > The client starts thousands of child processes which each start
    > by parsing the XML document in order to place its data in hash
    > tables, after which it is accessed via the hash table only.


    Are all the child processes on one machine, or distributed over
    many hosts all over the InterNet?

    I can see how parsing XML more than once for copies of exactly the
    same data structure would be a wasteful practice.

    > So the XML is very much an intermediate format, and the obvious
    > win would be to have the Java server package the data in a form
    > which is directly usable by the clients.


    You need to tell us what forms of data *are* usable by the clients.
    For example, if a C program were given the total size of all the
    atomic blocks of data plus all the machine-absolute pointers needed
    for all the links from a place within a non-atomic block to the
    start of another block, would the C program be able to malloc a
    single block of that size, then sub-allocate to build all the
    blocks and links between them? Then would the rest of the program
    be able to run using that in-RAM layout of data? If so, it seems to
    me the following protocol would suffice:
    - Transmit an integer telling the total size of the data block needed.
    - Transmit a block of bytes of that size, containing all the atomic
    blocks of data in correct relative position, with zero bytes for
    all the places where pointers are needed.
    - Transmit an integer telling the total number of links.
    - Transmit a block consisting of that many (place,target) pairs,
    each relative to the start of the overall block.
    The C client then reads that first integer, and mallocs a block of
    that size, then array-reads the block of that size into that block,
    then reads the second integer, then computes the size of array for
    that many pointers and allocates that size block, then block-reads
    a block of that size into that second block, then steps through
    that second block fetching each (relPlace,relTarget) pair and
    adding each value to the address of the start of the first block to
    yield a (absPlace,absTarget) and then executing *absPlace=absTarget,
    and finally that second block is freed. So that's 6 lines of code
    to do all the input, a program loop of 3 lines of code (or one line
    of code with nested expression) to convert and patch all the links,
    and finally one line of code to free that second array.

    > Presumably the data would arrive from the server and be stored in
    > a file: each client would then map the file into its address space


    You seem to be implying that all the clients are running on just
    one computer, possibly multiple CPUs sharing RAM, whereby it's
    possible to map one block of RAM into *all* the CPUs?

    > and treat it as a hash table (or similar data structure) right
    > away.



    Why a hash table??? Why do you need a hash table???
    You need to explain the abstract model of your data structure
    better so that we can decide whether a simple linked structure (as
    I assumed above), or a hash table is most appropriate for your
    needs, or possibly multiple independent linked structures with
    hash-table entries for each, or possibly multiple linked structures
    that share some of their structure with a hash table that points
    into various blocks within the overall structure, is most
    appropriate.

    > The traditional issue with transportable data structures is that
    > since the client can't reliably control what address the data is
    > mapped to, all addressing must be made relative to the starting
    > point.


    Agreed, but as soon as you have a hash table you can't even say
    what the relative locations are between various parts of the
    overall data corpus. If a hash table is used *only* for
    key-to-location lookup by the user interface, and all the corpus of
    data is pure linked-structure, then all you need is one large block
    of binary data (per the model I described earlier) plus a list of
    (key,location) pairs which can then be installed in the hash table.
    relLocation values would be transmitted, and the client would
    convert each to absLocation.

    I think you definitely need to create a *complete* toy example of
    what kind of data corpus structure you wish to transmit from server
    to client and build into disk file and thence into RAM on client,
    so that we can see what you are really wanting.

    > Does anyone know of an implementation of such a format which can
    > be generated in Java and consumed in C code?


    Before that can be answered, we need a better idea precisely what
    kind of data corpus structure you wish to transmit from server to
    client. It would do no good for me to spend one hour writing the
    server code and five minutes to write the client code, for the
    model I described above, if that wouldn't serve your needs.

  4. Re: offset-based hash table for ASCII data

    On 25.04.2008 18:05, Rex Mottram wrote:
    > Rex Mottram wrote:
    >>> Thanks again. And please let us know the outcome (aka speed
    >>> improvement) or any issues you find along the way. That way we can
    >>> all learn a bit. :-)

    >>
    >> Will do.

    >
    > As promised, the final results:


    Thank you for the summary!

    > A. I've had no problems with CDB in either of the versions I'm using; it
    > works quite well, is clearly documented, etc. It took a few minutes to
    > port TinyCDB to Windows but no big deal, and I've offered the diffs to
    > the author.


    Thank you, that way the community can benefit as well.

    > B. In the particular test case I was profiling, a typical run of the XML
    > version averaged around 19-20 minutes. The CDB version is around *2*
    > minutes. I consider this a rousing success.


    Indeed!

    > I was a little surprised, FWIW, to find that the CDB file was typically
    > no smaller than the XML file. Given how "wordy" XML is I'd have expected
    > almost any other format to be smaller, but CDB came in around the same
    > size. Of course these details will vary considerably with many factors
    > so this is just one data point.


    IIRC CDB uses hashing. At least in memory hash tables are usually
    larger than the number of entries in there so that might be the case
    here as well. I am sure a closer look at the documentation of the file
    format will reveal the reason but since you said that bandwidth was not
    an issue I guess it's not worthwhile bothering. Still a good hint for
    others.

    Kind regards

    robert

+ Reply to Thread
Page 3 of 3 FirstFirst 1 2 3