Bring kernel event notification in DragonFly BSD to its logical conclusion

Samuel J. Greear

Short description: The 2010 GSoC project that revamped kqueue/(2)kevent(2) and reimplemented select(2) and poll(2) on top of them was a success, but a number of issues and opportunities have been recognized since that time. This project aims to address these remaining issues and fully round out the kernel event notification mechanism in DragonFly BSD.

Name: Samuel J. Greear
Email: sjg@thesjg.com
Phone number (include country and area code): +1 (605) 569-1204

Project title: Bring kernel event notification in DragonFly BSD to its logical conclusion

Description of project goals, including details of how the delivered code will operate:
During my 2010 GSoC Project, I re-implemented the select(2) and poll(2) kernel interfaces on top of the kevent(2) interface. This project went mostly as-expected, but having had some time to settle a number of issues and additional opportunities have arisen. This project aims to bring the rework of the DragonFly kernel file descriptor event subsystems to its logical conclusion, fix any lingering bugs and generally put the code in maintainable and portable shape.

The first opportunity is to restructure the knote structure (kernel struct knote). Currently, when select(2), poll(2) or kevent(2) registers an interest in read, write or exceptional conditions (filters) on a descriptor the internal kqueue_register() function is called which looks up the requested knote based on the file descriptor and the filter type.

struct knote {
    SLIST_ENTRY(knote) kn_link; /* for fd */
    TAILQ_ENTRY(knote) kn_kqlink; /* for kq_knlist */
    SLIST_ENTRY(knote) kn_next; /* for struct kqinfo */
    TAILQ_ENTRY(knote) kn_tqe; /* for kq_head */
    struct kqueue *kn_kq; /* which queue we are on */
    struct kevent kn_kevent;
    int kn_status;
    int kn_sfflags; /* saved filter flags */
    intptr_t kn_sdata; /* saved data field */
    union {
        struct file *p_fp; /* file data pointer */
        struct proc *p_proc; /* proc pointer */
    } kn_ptr;
    struct filterops *kn_fop;
    caddr_t kn_hook;
};

Doing things in this fashion is perfectly acceptable for the kevent(2) interface, as read, write and exceptional conditions are logically separated. For the poll(2), select(2) and future epoll(2)/devpoll(2) interfaces it is problematic. For these interfaces the conditions are more closely tied to the descriptor. The code implementing select(2) and poll(2) at present must do a lot of extra work to return combined filter results. The epoll(2) interface would be extremely problematic to implement in this manner, potentially requiring full scans of all registered knotes. The logical step here seems to be to restructure the knote structure in such a fashion that filters are grouped based on the file descriptor. Currently I am thinking something like the following:

/* kqueue, select, poll, epoll, devpoll (5) */
#define KEVENT_INTERFACES 5
SLIST_HEAD(klist, knote)
struct knote {
    SLIST_ENTRY(knote) kn_link; /* for fd */
    TAILQ_ENTRY(knote) kn_kqlink; /* for kq_knlist */
    SLIST_ENTRY(knote) kn_next; /* for struct kqinfo */
    TAILQ_ENTRY(knote) kn_tqe; /* for kq_head */
    struct kqueue *kn_kq;
    union {
        struct file *p_fp; /* file data pointer */
        struct proc *p_proc; /* proc pointer */
    } kn_ptr;
    struct knote_data *notes[KEVENT_INTERFACES];
}

struct knote_data {
    struct kevent kn_kevent;
    int kn_status;
    int kn_sfflags; /* saved filter flags */
    intptr_t sn_sdata; /* saved data field */
    struct filterops *kn_fop;
    caddr_t kn_hook;
}

In this or a similar structure would allow the other implementations (select(2)/poll(2)) to do a great deal less work, allow for the relatively easy implementation of epoll(2) and has virtually no consequence to the speed or robustness of the original kqueue implementation.

After restructuring kevent to function as described above and updating the existing select(2) and poll(2) implementations, I would like to implement epoll(2) replacing the current Linux implementation as well as /dev/poll. This would uniquely differentiate DragonFly as the only platform to support all of the major i/o multiplexing schemes natively.

The next most pressing issue with the kevent implementation as it stands in DragonFly (and the other BSD's as well) is one of robustness. The beauty of kevent is that it is stateful and stateful means fast. The drawback of this is that there are many lists and queues that must be precisely managed in order for things not to go awry. This is evidenced by, for example, the ums usb mouse detach bug: http://bugs.dragonflybsd.org/issue1873 As it stands, all kqueue/kevent implementations suffer from this problem.

The most reasonable solution, apart from going back to a stateless approach like the former kernel poll support, is to limit the direct list handling as much as possible and force kqueue filter implementers to do the right thing where possible. I would propose to extend devfs with a new API for filter implementers that manages most aspects of device bring up, teardown and knote management for all devices which feature a devfs device node. Reference: http://leaf.dragonflybsd.org/mailarchive/kernel/2011-02/msg00006.html

As a next step I would like to take this approach of centralized control of the lists and data structures even further by extending the setup and teardown API's offered to the filters. As a part of this I foresee making the various structures used by kqueue be allocated and fully managed by the kqueue subsystem, especially the destruction of any used objects should only be done after the kqueue subsystem is certain that there are no more consumers. Reference: http://leaf.dragonflybsd.org/mailarchive/kernel/2011-02/msg00008.html

There are several open bug reports that are obviously problems with the current kqueue implementation or the select/poll wrappers. Once the bulk of the above work is complete I believe it would be prudent to both attempt to reproduce these bugs to see if they still exist and attempt to fix them if at all possible. Additional testing should also be done to ensure that no regressions were introduced and that the epoll and /dev/poll interfaces work as-expected with software that implements support for them. References: http://bugs.dragonflybsd.org/issue1730 http://bugs.dragonflybsd.org/issue1998 http://bugs.dragonflybsd.org/issue2011 http://bugs.dragonflybsd.org/issue2028

Finally, I would like to document the finalized filter implementation API and a porting guide for kernel developers, to ease bringing in drivers from other platforms.


Project timeline broken down by week, with details on when each feature described in the project goals will be available.
Community Bonding Period:
- Discuss expectations with my assigned mentor.
- Help other students get settled in with their projects.
- Setup a page or wiki and a place for branches to document my progress.
- Setup infrastructure required to support my development.
- Research and get a head start on my project.
Week 1 (May 23 - May 29):
- Validate that my assumptions will work and rework the kevent internals with a split knote structure.
Week 2 (May 30 - June 5):
- Finish implementing the split knote structure and begin re-implementing select(2) and poll(2).
Week 3 (June 6 - June 12):
- Finish select(2) and poll(2) and ensure everything still works like it should.
Week 4 (June 13 - June 19):
- Implement epoll(2) and test
Week 5 (June 20 - June 26):
- Implement /dev/poll and test
Week 6 (June 27 - July 3):
- Finish implementing epoll and devpoll and testing
- Design and begin implementing new devfs kevent API
Week 7 (July 4 - July 10):
- Finish implementing devfs API and conversion of all devices
- Lots of testing
Week 8 (July 11 - July 17):
- Mid-term week
- Ensure that all code completed to this point is cleaned up and committed so that my mentor may review it and perform a fair mid-term review.
- devfs api work possibly committed this week
Week 9 (July 18 - July 24):
- devfs API work finished, tested and commited
- Design and begin implementing new setup/teardown API's
Week 10 (July 25 - July 31):
- Finish implementing new setup/teardown API's and testing
- Testing, testing
- Fix bugs wherever possible based on application tests
Week 11 (August 1 - August 7):
- Final cleanup and commits
- Chase bugs
- Project completed and submitted to my mentor for final feedback and approval.
Week 12 (August 8 - August 14):
- Not near a computer, in Colorado, US acclimating to the elevation and competing in the Leadville 100 MTB race.


What additional resources as hardware or software will be needed for this project?
- Between leaf and my own hardware assets I should not require anything additional.

Please include or provide links to prior code related to this area of work.
- DragonFly committer and GSoC 2010 contributor, this should not be necessary.