inserting/deleting into/from the middle of large files? - Linux

This is a discussion on inserting/deleting into/from the middle of large files? - Linux ; 'lo there, =) having a DVB-S receiver running Linux (PPC) I recently found myself wondering how to delete data from the middle of a large file (stripping a recording of ads, for example). Currently, the common way of doing this ...

+ Reply to Thread
Results 1 to 12 of 12

Thread: inserting/deleting into/from the middle of large files?

  1. inserting/deleting into/from the middle of large files?

    'lo there, =)

    having a DVB-S receiver running Linux (PPC) I recently found
    myself wondering how to delete data from the middle of a
    large file (stripping a recording of ads, for example).
    Currently, the common way of doing this seems to be by
    copying the file (leaving a part behind) and then deleting
    the original. Of course, on a large file (say 12 GB) this
    can take an eternity; also you'll run into trouble if the
    filesystem is nearly full...
    Is it me, or doesn't that make any sense?
    Having a block-oriented filesystem, operations like this
    should only take an instance.

    So basically I'm looking for functions to:
    - insert a chunk into a file
    - delete a chunk from a file
    - move a chunk from one file into another

    All of the above would be very useful when dealing with
    large data, such as DVB-recordings (Linux being the nr.1 OS
    on those receivers, naturally:-).

    Since I couldn't find any system calls providing such
    functionality, I am now asking You Gurus whether I was just
    to stupid to find them, or if there should indeed be a
    standard (Posix?) for providing such functionality. (One
    call could be to determine if the filesystem supports those
    operations fast - it could return a version for instance, 0
    meaning that the operations, although provided, will be
    slow.)


    Thank you for any help! :-)
    LC (myLC@gmx.de)


  2. Re: inserting/deleting into/from the middle of large files?

    > how to delete data from the middle of a large file?

    Overwrite the data in its current place, turning the bytes
    into a "comment" or "skip this record."

    > Currently, the common way of doing this seems to be by
    > copying the file (leaving a part behind) and then deleting
    > the original.


    Yes.

    > So basically I'm looking for functions to:
    > - insert a chunk into a file
    > - delete a chunk from a file
    > - move a chunk from one file into another


    Most existing file systems for Linux do not have any such functionality.
    Common methods of accomplishing with the task are:
    1) Do it the obvious and slow way.
    2) Generate smaller pieces in the first place; 'cat' them
    together for ordinary usage.
    3) Run a process which "serves" and "delivers" the data on demand.
    Use an index, btree, etc. to skip the "deleted" parts.

    --

  3. Re: inserting/deleting into/from the middle of large files?

    In article <1183636719.088746.206070@w5g2000hsg.googlegroups.c om>,
    LC wrote:

    > 'lo there, =)
    >
    > having a DVB-S receiver running Linux (PPC) I recently found
    > myself wondering how to delete data from the middle of a
    > large file (stripping a recording of ads, for example).
    > Currently, the common way of doing this seems to be by
    > copying the file (leaving a part behind) and then deleting
    > the original. Of course, on a large file (say 12 GB) this
    > can take an eternity; also you'll run into trouble if the
    > filesystem is nearly full...
    > Is it me, or doesn't that make any sense?
    > Having a block-oriented filesystem, operations like this
    > should only take an instance.
    >
    > So basically I'm looking for functions to:
    > - insert a chunk into a file
    > - delete a chunk from a file
    > - move a chunk from one file into another
    >
    > All of the above would be very useful when dealing with
    > large data, such as DVB-recordings (Linux being the nr.1 OS
    > on those receivers, naturally:-).


    I think you could do it by mmap()ping the file and then using memmove()
    to shift the part of the file after the chunk being inserted or deleted.
    However, with a 12 GB file this would only work in a 64-bit OS.

    --
    Barry Margolin, barmar@alum.mit.edu
    Arlington, MA
    *** PLEASE post questions in newsgroups, not directly to me ***
    *** PLEASE don't copy me on replies, I'll read them in the group ***

  4. Re: inserting/deleting into/from the middle of large files?

    LC writes:

    > 'lo there, =)
    >
    > having a DVB-S receiver running Linux (PPC) I recently found
    > myself wondering how to delete data from the middle of a
    > large file (stripping a recording of ads, for example).
    > Currently, the common way of doing this seems to be by
    > copying the file (leaving a part behind) and then deleting
    > the original. Of course, on a large file (say 12 GB) this
    > can take an eternity; also you'll run into trouble if the
    > filesystem is nearly full...
    > Is it me, or doesn't that make any sense?
    > Having a block-oriented filesystem, operations like this
    > should only take an instance.
    >
    > So basically I'm looking for functions to:
    > - insert a chunk into a file
    > - delete a chunk from a file
    > - move a chunk from one file into another
    >
    > All of the above would be very useful when dealing with
    > large data, such as DVB-recordings (Linux being the nr.1 OS
    > on those receivers, naturally:-).
    >
    > Since I couldn't find any system calls providing such
    > functionality, I am now asking You Gurus whether I was just
    > to stupid to find them, or if there should indeed be a
    > standard (Posix?) for providing such functionality. (One
    > call could be to determine if the filesystem supports those
    > operations fast - it could return a version for instance, 0
    > meaning that the operations, although provided, will be
    > slow.)
    >
    >
    > Thank you for any help! :-)


    You're asking that on unix newsgroups.
    The philosophy of unix is to be simple.

    For the files, unix offers only a simple "sequence of byte"
    abstraction, and let applications implement anything more complex they
    want over this simple layer.


    Some older OS offered more complex type of files, like sequential
    record files, indexed files, variable length record, fixed length,
    etc. When the main peripheral was the card reader and card puncher,
    it looked like a logical way to organize files, as sequences of
    80-byte records... But these kind of filesystem didn't survive, in
    part because it made both the system and the applications more
    complex.


    But this doesn't prevent us to implement these kind of files in unix
    if they're needed. For example, libdb (see dbopen(3)) implements
    various kinds of indexed record files. With this kind of files, you
    could more easily insert or remove blocs in the middle of the file.
    And of course, if libdb doesn't offer the exact features you need,
    there are several other such libraries, and you can implement your
    own.


    Of course, now the problem is to make the applications use these
    libraries, to structure their files in a meaningfull way.

    The main problem in your question is that these DVB-S files have
    probably a structure, and if this structure doesn't correspond to the
    blocks of the file system, it won't serve you to have the ability to
    insert or move chunks from under. You have to know the file format.

    In particular, even if the file is structured in some kind of blocks,
    there's no reason why the transitions from movie to ad or from ad to
    movie fall exactly on block frontiers. And there is no reason why the
    file should be still consistent after having removed some blocks in
    the middle: the file format may specify some offsets or indices in the
    file, and removing some blocs would invalidate these offsets rendering
    the file unusable in whole.


    --
    __Pascal Bourguignon__ http://www.informatimago.com/

    NOTE: The most fundamental particles in this product are held
    together by a "gluing" force about which little is currently known
    and whose adhesive power can therefore not be permanently
    guaranteed.

  5. Re: inserting/deleting into/from the middle of large files?

    >having a DVB-S receiver running Linux (PPC) I recently found
    >myself wondering how to delete data from the middle of a
    >large file (stripping a recording of ads, for example).
    >Currently, the common way of doing this seems to be by
    >copying the file (leaving a part behind) and then deleting
    >the original. Of course, on a large file (say 12 GB) this
    >can take an eternity; also you'll run into trouble if the
    >filesystem is nearly full...
    >Is it me, or doesn't that make any sense?
    >Having a block-oriented filesystem, operations like this
    >should only take an instance.
    >
    >So basically I'm looking for functions to:
    >- insert a chunk into a file
    >- delete a chunk from a file
    >- move a chunk from one file into another


    These are easy to do if you're willing to make them slow enough.

    insert a chunk of size N into a file:
    Copy from the insertion point to the end of the file to a point
    N bytes past the insertion point (make sure not to do destructive
    overlap). Now write the new data starting
    at the insertion point.

    delete a chunk of size N into a file:
    Copy from the first byte to be kept after the deletion segment to
    the end of the file to the start of the deletion point. (make sure not
    to do destructive overlap). ftruncate() N bytes off the end of the file.

    move a chunk from one file into another:
    I think this is an insertion followed by a deletion.



  6. Re: inserting/deleting into/from the middle of large files?

    On 5 jul, 13:58, LC wrote:
    > 'lo there, =)
    >
    > having a DVB-S receiver running Linux (PPC) I recently found
    > myself wondering how to delete data from the middle of a
    > large file (stripping a recording of ads, for example).


    Wen removing _multiple_ sections from a video recording,
    the video editor just makes a list with start and endpoints of these
    sections.

    When in preview, the list is executed (displayed from the sections
    specified).
    As there can be _many_ sections this has some huge advantages.
    There are other issues so as to point to the exact mpeg2 boundary
    frame.
    when selecting a splice point, this is transparent to the user.
    A typical example that works that way is 'lve' (Linux Video Editor),
    it just creates an edit list.
    Only when all edits have been done (fast in a GUI) is the actual
    final output file created made up of all the pieces you selected.
    So your problem is no problem.


  7. Re: inserting/deleting into/from the middle of large files?

    >>... how to delete data from the middle of a large file ...

    >>So basically I'm looking for functions to:
    >>- insert a chunk into a file
    >>- delete a chunk from a file
    >>- move a chunk from one file into another


    > I think you could do it by mmap()ping the file and then using memmove()
    > to shift the part of the file after the chunk being inserted or deleted.
    > However, with a 12 GB file this would only work in a 64-bit OS.


    Using mmap + memmove [+ truncate] does save space in the filesystem.
    However: there is added CPU and memory time (fetch+store, cache misses,
    page faults) to perform the memmove(), the disk transfer burden is no less
    than a series of 'cp' or 'dd' commands, and using memmove is much more
    fragile in the face of power failure. If you avoid a journaling file
    system, then current commodity SATA drives can deliver 30 to 60 MB/s,
    so 12 GB in plus 12 GB out is around 8 to 15 minutes. This might be
    less than watching the commercials once. ;-)

    --


  8. Re: inserting/deleting into/from the middle of large files?

    Barry Margolin wrote:

    > I think you could do it by mmap()ping the file and then
    > using memmove() to shift the part of the file after the
    > chunk being inserted or deleted. However, with a 12 GB file
    > this would only work in a 64-bit OS.


    Yes, but I doubt the OS or rather the FS will do anything
    other than the usual copy operation. You cannot remove
    something from the middle or insert into it this way (not
    without copying).

    ---
    Pascal Bourguignon wrote:

    > ... For example, libdb (see dbopen(3)) implements various
    > kinds of indexed record files. With this kind of files, you
    > could more easily insert or remove blocs in the middle of
    > the file. And of course, if libdb doesn't offer the exact
    > features you need, there are several other such libraries,
    > and you can implement your own.
    >
    > Of course, now the problem is to make the applications use
    > these libraries, to structure their files in a meaningfull
    > way.


    Exactly. Those receivers usually have only limited
    processing power. Doing many things at once, what helps a
    great deal is simply passing on the MPEG2 data stream coming
    from the satellite/cable to the harddisk. Anyhow, one would
    have to modify the entire system and the data format. All
    programs relying on the current data format would therefore
    cease to function...


    > The main problem in your question is that these DVB-S files
    > have probably a structure, and if this structure doesn't
    > correspond to the blocks of the file system, it won't serve
    > you to have the ability to insert or move chunks from under.
    > You have to know the file format.
    >
    > In particular, even if the file is structured in some kind
    > of blocks, there's no reason why the transitions from movie
    > to ad or from ad to movie fall exactly on block frontiers.
    > And there is no reason why the file should be still
    > consistent after having removed some blocks in the middle:
    > the file format may specify some offsets or indices in the
    > file, and removing some blocs would invalidate these offsets
    > rendering the file unusable in whole.


    In the DVB case, it's a stream. Being able to "operate" on a
    block-basis would already help.
    The filesystems, however, also have means of dealing with
    files smaller than the actual block-size. I doubt that they
    currently have the means to have blocks with smaller content
    IN BETWEEN the chains. However, I cannot see why implemen-
    ting it shouldn't be possible. If done, one could use the
    functionality transparently - regardless of filetype.
    Many applications could benefit from this...


    pantel...@yahoo.com wrote:

    > Wen removing _multiple_ sections from a video recording, the
    > video editor just makes a list with start and endpoints of
    > these sections.
    > ... Only when all edits have been done (fast
    > in a GUI) is the actual final output file created made up of
    > all the pieces you selected. So your problem is no problem.


    Yes, I'm very much aware of that. The actual problem is
    nevertheless the "final stage". In case of my receiver there
    is a script performing those operations. It is supposed to
    be added to cron and run at nighttime (say 4 o'clock) as the
    finalizing part can take several hours of copying on a slow
    box. If there were support by the FS the same job could be
    done in a few (milli)seconds. That is the problem... ;-)


    Knowing what I know now (i.e., the functionality isn't there
    yet) - thanks to you folks! - my question is probably better
    placed in a group dealing with the actual implementation of
    a filesystem such as ext3.

    Again, thanks for your help! =)

    Regards,
    LC (myLC@gmx.de)


  9. Re: inserting/deleting into/from the middle of large files?

    Barry Margolin wrote:

    > I think you could do it by mmap()ping the file and then
    > using memmove() to shift the part of the file after the
    > chunk being inserted or deleted. However, with a 12 GB file
    > this would only work in a 64-bit OS.


    Yes, but I doubt the OS or rather the FS will do anything
    other than the usual copy operation. You cannot remove
    something from the middle or insert into it this way (not
    without copying).

    ---
    Pascal Bourguignon wrote:

    > ... For example, libdb (see dbopen(3)) implements various
    > kinds of indexed record files. With this kind of files, you
    > could more easily insert or remove blocs in the middle of
    > the file. And of course, if libdb doesn't offer the exact
    > features you need, there are several other such libraries,
    > and you can implement your own.
    >
    > Of course, now the problem is to make the applications use
    > these libraries, to structure their files in a meaningfull
    > way.


    Exactly. Those receivers usually have only limited
    processing power. Doing many things at once, what helps a
    great deal is simply passing on the MPEG2 data stream coming
    from the satellite/cable to the harddisk. Anyhow, one would
    have to modify the entire system and the data format. All
    programs relying on the current data format would therefore
    cease to function...


    > The main problem in your question is that these DVB-S files
    > have probably a structure, and if this structure doesn't
    > correspond to the blocks of the file system, it won't serve
    > you to have the ability to insert or move chunks from under.
    > You have to know the file format.
    >
    > In particular, even if the file is structured in some kind
    > of blocks, there's no reason why the transitions from movie
    > to ad or from ad to movie fall exactly on block frontiers.
    > And there is no reason why the file should be still
    > consistent after having removed some blocks in the middle:
    > the file format may specify some offsets or indices in the
    > file, and removing some blocs would invalidate these offsets
    > rendering the file unusable in whole.


    In the DVB case, it's a stream. Being able to "operate" on a
    block-basis would already help.
    The filesystems, however, also have means of dealing with
    files smaller than the actual block-size. I doubt that they
    currently have the means to have blocks with smaller content
    IN BETWEEN the chains. However, I cannot see why implemen-
    ting it shouldn't be possible. If done, one could use the
    functionality transparently - regardless of filetype.
    Many applications could benefit from this...


    pantel...@yahoo.com wrote:

    > Wen removing _multiple_ sections from a video recording, the
    > video editor just makes a list with start and endpoints of
    > these sections.
    > ... Only when all edits have been done (fast
    > in a GUI) is the actual final output file created made up of
    > all the pieces you selected. So your problem is no problem.


    Yes, I'm very much aware of that. The actual problem is
    nevertheless the "final stage". In case of my receiver there
    is a script performing those operations. It is supposed to
    be added to cron and run at nighttime (say 4 o'clock) as the
    finalizing part can take several hours of copying on a slow
    box. If there were support by the FS the same job could be
    done in a few (milli)seconds. That is the problem... ;-)


    Knowing what I know now (i.e., the functionality isn't there
    yet) - thanks to you folks! - my question is probably better
    placed in a group dealing with the actual implementation of
    a filesystem such as ext3.

    Again, thanks for your help! =)

    Regards,
    LC (myLC@gmx.de)


  10. Re: inserting/deleting into/from the middle of large files?


    Sorry 'bout the double post - **** Google's new beta interface! :-P



  11. Re: inserting/deleting into/from the middle of large files?

    On a sunny day (Fri, 06 Jul 2007 11:38:02 -0700) it happened LC
    wrote in <1183747082.651425.12440@n60g2000hse.googlegroups.c om>:

    >pantel...@yahoo.com wrote:
    >
    >> Wen removing _multiple_ sections from a video recording, the
    >> video editor just makes a list with start and endpoints of
    >> these sections.
    >> ... Only when all edits have been done (fast
    >> in a GUI) is the actual final output file created made up of
    >> all the pieces you selected. So your problem is no problem.

    >
    >Yes, I'm very much aware of that. The actual problem is
    >nevertheless the "final stage". In case of my receiver there
    >is a script performing those operations. It is supposed to
    >be added to cron and run at nighttime (say 4 o'clock) as the
    >finalizing part can take several hours of copying on a slow
    >box. If there were support by the FS the same job could be
    >done in a few (milli)seconds. That is the problem... ;-)


    Well I dunno, I have been recording digital sat DVB-S for many
    years, starting on a AMD K6, with a SkyStar1 PCI card with hardware
    mpeg2 decoder....

    DVB-S TV is about 2GB max per hour, sometimes much less..
    'The actual problem' is that you first need to understand the transport
    stream format, then the contents of it, mp2 sound, AC3 sound, mpeg2 video,
    and how to cut those streams (sound is not in the same pace as video).
    Even on something acient as a K6 'copying' just is some minutes,
    it only depends on the harddisk speed.
    From you posting I do not get the impression that you work with HD material
    (about 10GB/hour).

    There is something else about removing ads too.
    I have stopped doing it, because in editing I did see those ads many many times
    over, much more then when just fast-forwarding the movie.
    There are always issues with sound - video sync too when editing, so better
    forget about it, just use fast-forward in playback.

    I would leave the filesystems in one piece, they are really good.
    These days I go even one step further, if I have a movie I want to keep,
    say 2 hours or 4 GB, I just grab a DVD+R, and do this with the recorded .ts transport
    stream:
    growisofs -speed 16 -Z /dev/dvd=my_recording.ts
    So now the DVD is an _image_ of the recording.
    No authoring, no filesystem limits, no filesystem!!!, and play back like this:
    cat /dev/dvd | mplayer -ao alsa:device=hw=1,0 -fs -cache 8192 -vop pp=0x20000 -

    Cannot beat this for speed and reliability and efficiency, as it allows
    4 700 000 000 bytes on a DVD, and no filesystem overhead.

    Those who want to sing about wrong use of cat please do it in the bathroom.

    EL Pante
    By using above methods you agree to the small print.


  12. Re: inserting/deleting into/from the middle of large files?

    On a sunny day (Fri, 06 Jul 2007 11:38:02 -0700) it happened LC
    wrote in <1183747082.651425.12440@n60g2000hse.googlegroups.c om>:

    ---- replay previous text -------
    I would leave the filesystems in one piece, they are really good.
    These days I go even one step further, if I have a movie I want to keep,
    say 2 hours or 4 GB, I just grab a DVD+R, and do this with the recorded .ts transport
    stream:
    growisofs -speed 16 -Z /dev/dvd=my_recording.ts
    So now the DVD is an _image_ of the recording.
    No authoring, no file system limits, no filesystem!!!, and play back like this:
    cat /dev/dvd | mplayer -ao alsa:device=hw=1,0 -fs -cache 8192 -vop pp=0x20000 -

    Cannot beat this for speed and reliability and efficiency, as it allows
    4 700 000 000 bytes on a DVD, and no filesystem overhead.

    Those who want to sing about wrong use of cat please do it in the bathroom.
    ---- end previous text --------

    As a side note: why use this construct?

    Now suppose you have the .ts recording running all night long from 20:00 to 05:00 at night.
    Just to get all movies, check them out later, = 9 hours is about 18 GB.

    This does not fit a DVD+R, so check quickly where about the good stuff starts with
    xine recording.ts
    This gives you a time in minutes.

    But now how to extract and burn the right stuff to DVD?
    Say if we have 1.8GB / hour then we have .9GB / 30 minutes or 90MB / 3 minutes, 30 MB / minute.

    So now we can test where the good part starts (end is less important):
    dd if=recording.ts bs=30000000 skip=MINUTES | mplayer -ao alsa:device=hw=0,0 -fs -cache 8192 -vop pp=0x20000 -

    Just take a guess, and use successive approximation to quickly (say 10 tries) to find
    the exact start, then burn the stuff to DVD:
    dd if=recording.ts bs=30000000 skip=START_MINUTES | growisofs -speed 16 -Z /dev/dvd=/dev/stdin
    It will stop when the DVD is full....

    You can use smaller granularity by reducing the 30000000.
    So now we can write a simple script.....




+ Reply to Thread