Hi!

[Sorry if it is too late for SoC, but I was unexpectedly busy last 3 days
and couldn't finish this text earlier.]

This is a proposal for ipfw improving ideas and architectural changes.
Some of them are independent of each other and could be implemented
without ABI breaking in STABLE, but, whether all of these will be a
SoC 2008 candidate or not, should be finally implemented in FreeBSD.
The only question is what should be corrected, so please discuss it

This text also includes slightly changed and/or generalized ideas from:
http://lists.freebsd.org/pipermail/f...il/002931.html

All syntax examples are only to give idea, this should be discussed.

1. Major changings (ABI breaking is necesary).

1.1. Dynamic rules reorganizing.

Description:

Current ipfw's dynamic rules are not suitable for several advanced
tricks. For example, it is not possible to use saved information about
current state of connection in the firewall rules elsewhere, and it is
not possible to change that state from firewall also.

Wanted features:

* Ability to create/delete dynamic rule in any state via some API or ABI
from all parts of system: userland, ipfw rules, other kernel modules.
This can be useful for:

a) Creating dynamic rule in the middle of connection, not only setup:

ipfw add pipe 1 ip from any to any tagged 412 keep-state-middle

This allows to change handling of connection after some event,
e.g. L7 filtering by ng_bpf + ng_tag discovered that a connection
belongs to some class by analyzing packet payload, and from now on
connection should go directly with dynamic rules, but never sent
again to expensive L7 processing.

Currently you can use just "keep-state" for this, but ipfw will
not see SYN's and rule will be subject to sysctl
net.inet.ip.fw.dyn_rst_lifetime - by default expires after
1 second, which is undesirable for many cases.

b) Ability to save/load dynamic rules in userland with files, e.g.,
to continue after reboot.

c) Ability to exchange with rules state with other machine with ipfw,
e.g., two firewalls in a CARP failover.

d) Creation of rule with specified state and parameter before actual
connection would be established. E.g. imagine a by-default-closed
firewall with a netgraph(4) module analyzing FTP control
connection and giving commands to ipfw to open dynamic "holes" for
data connections, thus elimanating current practice of opening
ports in the entire range 1024-65535 (insecure, yes).

One can think about providing direct exchange between libalias(3)'s
alias_link and ipfw dynamic rules, but that's a subject for further
research.

* Additional fields in dynamic rules to keep arbitrary info for
specific connection, and opcodes for loading and storing that values
from other parts of firewall or elsewhere. This will allow to
implement a pf(4)'s "scrub" maximum TTL enforcing on connection,
but not only that - generic data storage allows any future extensions.

* Ability to change dynamic rule's parent rule "on the fly" (just changing
a pointer to which static rule's ACTION_PTR to jump, yes). The latter
will allow aforementioned distinguishing of connection packets
before/after L7 processing in the case where packets are always
classified to flows before any processing takes place - that example
with "keep-state-middle" assumed that main firewall is stateless,
only L7-matched packets are subject to be dynamic. And this allows to
reassign an action for dynamic rule:

...
ipfw add 100 check-state
...
ipfw add 200 skipto 500 ip from any to any keep-state
...
ipfw add 500 netgraph 41 ip from any to any
ipfw add 600 change-parent 800 ip from any to any tagged 412
ipfw add 700 allow ip from any to any
ipfw add 800 pipe 1 ip from any to any

* More types for dynamic rules system would allow not only "keep-state"
and "limit", but rather be extensible to something more. E.g., current
"limit" rules just drop packets if limit is reached - but user
possibly wants an option to process them with another rule afterwards.

Possible implementation:

* For arbitrary info: add a union of one uint32_t or two uint16_t's or
four uint8_t's two each dynamic rules and operations to load/store
those values (or may be an uint64_t and two uin32_t's and so on?..).
Also add one void* to allow to store more data if one needs to.

* Make a special netgraph node (or extend ng_ipfw) which will broadcast
every change in dynamic rules to all it's hooks (how many to bundle
into one mbuf should be customizable). Every input with structs of
the same format will result in addition or deletion of dynamic rules
in ipfw. A netgraph node method of work provides flexible and extensible
way to manipulate dynamic rules: you can connect to it protocol-trackers
which will insert rules for secondary connection (e.g. FTP); you can
connect to it userland tool which will log all dynamic rule changing
or will do load/save of rules in a file; you can connect to it an
ng_ksocket(4) node with UDP to broadcast to someone or TCP to connect
to another machine with the same setup to provide CARP failover.

Note that node should not do delivery/retransmission checks as
pfsync(4) does, because this is a task for someone other (to keep
modularity), but two such nodes on different machines connected to
each other should provide automatic rules synchronizing without
additional actions after initial setup.

1.2. Userland (and other subsystems) interaction, modularity, rulesets.

Description:

Currently /sbin/ipfw2 is a custom-made parser which communicates with
the kernel via setsockopt() calls. It is sometimes hard to extend with
new features due to complex code. Using a socket instead a /dev entry
means you always need to be root (uid 0) to both read firewall
configuration and to change it. In-kernel protocol is also sometimes
hard to extend, while some addional entire-ruleset features are useful.

Wanted features:

* Parser's code (sbin/ipfw2.c) should be reviewed and possibly
rewritten using lex(1)/yacc(1). Syntax is ocmplicated, however, and
it may be not possible to not implement all of it exactly. This
should be further investigated.

* It may be desirable to give some other user ability to at least read
config and may be to write, as /dev/bpf* permissions allow it for
tcpdump(1).

* Device entry could also improve modularity: currently to add a new
IP_FW_* socket option, you have to modify netinet/raw_ip.c, which
means you can't just recompile /sbin/ipfw and ipfw kernel module.

* The same applies to other ipfw-related facilities: dummynet, divert,
NAT. It can be good to keep them configurable by some other means
rather than tweaking raw_ip.c. It can be useful to separate dummynet
and divert to it's own facilities to be able to use them without
ipfw, e.g., from netgraph(4). Related to this is a problem with IPSEC
interaction - if you use it with divert(4) on output, then on return
from divert packets will be IPSEC'ed again because in ip_output()
IPSEC is called before pfil(9). It could be useful to add an option
for user (in addition to existing behaviour, to not break POLA) to
call IPSEC processing from specified place in ruleset just like all
others:

ipfw add ipsec ip from any to any out

* As patch about using rule counters is currently discussed in ipfw@, it
is useful to add ability to change rule counters to arbitrary values
rather than providing the only "zero" action. This is closely related
with an option of restore ipfw's static ruleset without losing
counter values. Currently you can save "ipfw list" to file, do an
"awk '{print "add " $0}'" on it and then load it again (e.g. after
reboot). It must be possible to do the same with "ipfw show". Syntax
example for providing counters with "ipfw add" - all cases are
distinguishable (current syntax allow only first two):

ipfw add allow ip from any to any # select next rule number
ipfw add 100 allow ip from any to any # exact rule number specified
ipfw add 1234 76845 allow ip from any to any # counters without rulenum
ipfw add 100 1234 76845 allow ip from any to any # rulenum and counters

* Static ruleset loading and saving is closely related with ruleset
precompilation and atomic commits. Imagine a rulesets with thousands
of rules: if a packet arrives in the middle of ruleset updating,
strange effects can occur. Of course, you can achieve the same
results with sets, by disabling new set and atomically swapping them
later, but that is not always comfortable. Precomplilation of the
whole ruleset and then atomically installing it ("transaction commit")
requires an implementation which will also allow saving and loading
precompiled ruleset in binary form - good for routers where 20K-rules
script can be processed for several minutes.

* Precompiled binary rules can also be used for the same rule setting
from both other kernel subsytems and other machines (CARP again).
Thus, generic binary rule format/protocol (not only for /dev) might
be invented. Moreover, compiled ruleset format may be different from
current linked list, which has disadvantages of both initial "skipto"
(and planned "call/return", see next section) and disabled-set-rules
are still traversed. Precompiled form of opcodes-only allows to do
quick jumps, easy running of cross-rule optimizations (and even
possibility to compile ipfw opcodes to machine code like BPF_JITTER
for bpf(4) for more speed). This has disadvantages of separate rule
counters keeping and not-so-transparent need for user to recompile
every time, so should be further investigated.

* About several rulesets, for different interfaces (or hacks like
per-interface setting of rule number to jump to on it): I think that
this is unnecessary and unfriendly to user - having one rulesets is
simpler, and you usually need common checks on packets. So "commit"
precompiled rules, "call/return" actions (see next section) and stack
virtualization via "vimage" should serve all practical purposes.

Possible implementation:

General view is clear from features description. One also can think about
netgraph(4) node for this (again) and/or something like shared memory
pages between kernel and userland, to not allocate memory in kernel
twice for big rulesets.

2. Independent (minor) changes, which can be possible without ABI breakage.

2.1. call/return rule actions.

Description of feature:

A "skipto" rule is known as a useful tool to optimize packet flow
through ruleset, also able to assign several actions to a dynamic rule
(because dynamic rule on match simply jumps to action part of parent
rule). But it can only jump forward, not backwards, for the same reason
as bpf(4) assember instruction: to prevent infinite loops in packet flow
which will cause machine to hang network operations. This can be
addressed by introducing a pair of instructions, call and return, which
remembers position to return in the stack of some kind. Because return is
always done to the next rule after calling one (by number, as with
divert/skipto), it is guaranteed that infinite loops can't occur, even
in case of calling one rule many times by simply proceeding to next rule
after stcak overflow.

Thus call/return pair allows to organize some kind of subroutines, with
the trick that issuing actual number lets to jump to the middle of
subroutine, as in assembly language:

ipfw add 100 call 600 ip from any to any in recv $internal
ipfw add 100 call 700 ip from any to any in recv $external
...
ipfw add 500 allow ip from any to any
ipfw add 600 deny ip from any to any not antispoof
ipfw add 700 deny tcp from any to any 135,445
ipfw add 900 return // for both those calls

It should be noted again that calls are made by rule numbers, so in the
following example the first "call 700" will pass control to rule 301,
not second rule 300.

ipfw add 300 call 700 ...
ipfw add 300 call 800 ...
ipfw add 301 count ip ...

Allowing to use "tablearg" in "call" would be very useful. Parser should
allow both version of "return", with some conditions (ususal rule body)
and without them (like "check-state").

Possible implementation:

Relatively easy. Allocate a mbuf tag for a stack of uint16_t rule
numbers and a stack top pointer on first "call" for mbuf. The only thing
to care are divert etc. calls, and distinguishing input and output
passes (firewall can be called several times for each), thus stack
underflow and overflow should be carefully analyzed. May be two tag
types, one for input and one for output.

It is difficult, however, to get this performing well, because of
linked-list nature of ruleset and inability to cache pointer to skip
destination, as done with "skipto" currently, because there can be
several locations (even tablearg). Possible solutions may be to keep
a cache to, say, 256 points in the list (rulenum / 256) to reduce
looking after this point (effectively equivalent to hash on rulenum).
Or to have compiled rulesets where offset to jump is easily calculated
(see previous section).

2.2. Tables and tableargs.

Tables are very powerful way to both increase processing speed and
conveniently reduce rule maintaing cost for user, especially with
tableargs. Tables, however, are currently limited to IPv4
addresses/masks as keys and uint32_t's as values. Table keys should be
extended to another data types: IPv6 addresses, interface name strings:

ipfw add allow ip from any to table6(1) in recv stringtable(2)

or

ipfw call tablearg ip from any to any via stringtable(3)

The latter will be very handy for routers with e.g. 2000 VLAN or ng*
interfaces, with separate client and rules for each.

Tableargs should also be expanded to 16 bytes, to be able to store IPv6
address ot uint64_t for checking e.g. in rule counters. It is
questionable whether tableargs could also be short (< 16 bytes) strings
like interfaces' names.

Due to implementation difficulties of distinguishing whether action
parameter is a valid value or a tablearg (you usualyy have only one
invalid value out ouf 65536 which is get assigned as tablearg
indicator), I suggest adding operations like "settablearg" which will
set tablearg without actual table used, e.g., from saved arbitrary info
from dynamic rules (see section 1.1) or even packet header. So, values
for "computed goto" or something like registers still be used by
tablearg (just generalizing definition of table), or, at least this
should be so in opcode level - user could be present with some other
keyword, but I don't see any point in hiding this details.

Number of tables of all types should be configurable via sysctl or at
least loader tunable rather than current hradcoded number (128).

2.3. Time limit counter.

An opcode for a token bucket and/or leaky bucket should be introduced.
This will have a one counter changed with timer and other changed by
actual packets. We currently have O_LOG opcode looking similar to this,
but O_LOG has nothing to deal with timer. Proposed opcode must be useful
at least for limiting a number of connections per second, but any other
possible use is appreciated, from simplest shaping without dummynet to
more exotic like counter "price" coefficinets allowing to build an
in-kernel billing solely on ipfw counters.

It is questionable where values of counters should be stored, due to
locking optimising - directly, as with O_LOG, or separately addressable
space like tables.

2.4. Action rules and parameters.

Change ACTION_PTR handling in kernel and preparing in compiler to allow
actions and their parameters to be placed in any order (except for
opcodes where order is required, e.g. prob). This would easily allow
placing several opcodes of the same type to action part, e.g.:

ipfw add count tag 1 tag 2 tag 3 ip from any to any

and using actions and their parameters interchangeably, like having
a rule without actual action opcode (only parameter instead), e.g. use
"tag" or "altq" as action too (equals to "count").

2.5. Just to mention: modip, counter limits, fragments.

These patches are already currently discussed in ipfw@, but included
here just to not forget. These are "modip" action, allowing to modify IP
header (DSCP, ToS, TTL) and corresponding match rule options, and a rule
option to match when rule counters are less then specified number
packets or bytes (possibly from dynamic rule's counters), may be
a tablearg. This is also related with mentioned in section 1.2 ability
to control rule counters.

Adding a few keywords for O_FRAG more fragment matching (not only
non-first fragment), e.g. for sending to specialized netgraph(4)
reassembling module, is also desirable.


That's all for today. Any comments, additions, corrections are welcome!

--
WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru
[Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/lis...reebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"