3acd6a08d5add the true spider request when retrying to spider a fake-ip spider request. add a EFAKEIP error reply for the fake ip request. prevents us double spidering the same url.
Matt Wells
2013-12-23 10:27:42 -08:00
11d6d5ad6aMerge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-12-23 09:30:52 -08:00
2ac8ff2952compile regex so it's case dependent
Matt Wells
2013-12-23 09:30:35 -08:00
b0d77a834ado not spider fake ips requests, just re-add them with the right firstip
Matt Wells
2013-12-20 12:22:02 -08:00
6f2e552bcdfix core in linked list of msg13requests in case one gets freed
Matt Wells
2013-12-20 11:26:46 -08:00
5fcfff6729fixes for spiders getting stuck
Matt Wells
2013-12-19 20:04:06 -07:00
4c7ce819b9fix core dump
Matt Wells
2013-12-19 18:39:29 -08:00
c2f8445a70expand reg ex shortcuts like \d to [0-9]
Matt Wells
2013-12-19 18:31:37 -08:00
261f4feb9bfixed cdata parsing issue
Matt Wells
2013-12-19 16:04:53 -08:00
3092dcecaarebuild url filters and regexes at startup
Matt Wells
2013-12-19 15:56:27 -08:00
99099505d8call regfree before changing regex
Matt Wells
2013-12-19 15:32:26 -08:00
7f70e4e887fix regex logic
Matt Wells
2013-12-19 15:19:18 -08:00
aad12f9fe3minor print format fix
Matt Wells
2013-12-19 14:30:56 -08:00
ef5decb0b8more fixing stuck spiders
Matt Wells
2013-12-19 14:17:22 -08:00
32db83ae47try to fix spiders from petering out. reset doledb next keys and empty flags every 3 minutes.
Matt Wells
2013-12-19 13:31:14 -08:00
cb111a1efafix doledb empty logic
Matt Wells
2013-12-19 13:06:35 -08:00
d2f9dcf8e0revert last commit. not needed
Matt Wells
2013-12-19 10:33:04 -07:00
784b6900cdmore spider fixes
Matt Wells
2013-12-19 10:29:01 -07:00
d5f63888a3Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
Matt Wells
2013-12-19 10:15:05 -07:00
56461ee795fix spidering getting stuck bug
Matt Wells
2013-12-19 10:14:50 -07:00
a440e1cbf5update admin link on root page and documentation for url filters
mwells
2013-12-18 19:51:50 -07:00
a0ceade641fix oom doleiptable using too much mem so bulk job went oom
Matt Wells
2013-12-18 17:20:53 -08:00
e93cfe8ac6Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-12-18 15:57:39 -08:00
58ce15a7f3fix big post of 70MB of urls
Matt Wells
2013-12-18 15:57:10 -08:00
894ced5b08Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2013-12-18 15:29:54 -08:00
356836812dwtf i did not modify these files.
Matt Wells
2013-12-18 15:26:50 -08:00
60dddfc669final fixes for parms
Matt Wells
2013-12-18 15:22:54 -08:00
6f0137889bfixes for getUrlFilterNum so it looks at "hadReply" bit in SPiderRequest when getting diffbot api url.
Matt Wells
2013-12-18 14:05:41 -08:00
7170c4f0earebuild url filters was not getting called when some relavent parms were updated.
Matt Wells
2013-12-18 13:24:38 -08:00
1b5057ad42log cleanups mostly. took out disk page cache, kinda buggy... need to fix at some point.
Matt Wells
2013-12-18 10:57:18 -08:00
2ffad5d835fix cores
Matt Wells
2013-12-17 17:35:07 -08:00
2f2333abd1parmdb fixes
Matt Wells
2013-12-17 14:53:33 -08:00
33ba8070b5more bug fixes parmdb
Matt Wells
2013-12-17 13:09:05 -08:00
31e16c972dfix restart crawl
Matt Wells
2013-12-17 11:17:33 -08:00
39a0b7f85eparm updates
Matt Wells
2013-12-17 10:53:12 -08:00
d03028ea93bulk api post truncation fix
Matt Wells
2013-12-17 10:03:46 -08:00
2cd53386adparm updates
Matt Wells
2013-12-17 09:51:08 -08:00
523d32a2eaparmdb updates
Matt Wells
2013-12-16 18:13:38 -08:00
ad4a4415d0fix pauseCrawl
Matt Wells
2013-12-16 17:21:59 -08:00
3f19ece776parmdb updates
Matt Wells
2013-12-16 17:07:15 -08:00
617a0ff76eparmdb fixes
Matt Wells
2013-12-16 16:04:43 -08:00
6c652c1cc6more parmdb fixes
Matt Wells
2013-12-16 15:39:24 -08:00
9a65febd9eparmdb updates
Matt Wells
2013-12-16 14:35:27 -08:00
1fe91cad2fparmdb updates
Matt Wells
2013-12-16 14:10:39 -08:00
9b080ff89cmore parmdb bug fixes
Matt Wells
2013-12-16 13:36:31 -08:00
9be1ab6323more parmdb fixes
Matt Wells
2013-12-16 12:20:13 -08:00
a727fb10e6parmdb fixes for checkboxes. use radio buttons.
Matt Wells
2013-12-16 11:41:43 -08:00
9cb99f7621Merge branch 'diffbot' into diffbot-testing
Matt Wells
2013-12-16 11:06:11 -08:00
0615acff17zero out url filters checkboxes on submit
Matt Wells
2013-12-16 11:03:40 -08:00
2b10a3327dMerge branch 'master' into diffbot
Matt Wells
2013-12-16 10:49:40 -08:00
22eb06e54da few bugfixes imported from neo github subdir
Matt Wells
2013-12-16 10:49:13 -08:00
660f43cec7fix bugs of pthreads junk not being async safe. we were calling fprintf from a signal handler (interrupt) while fprintf was currently in progress and the pthread junk did not like that.
Matt Wells
2013-12-15 11:41:41 -07:00
06f67db16bforgot to unlock thread lock
Matt Wells
2013-12-15 10:43:34 -07:00
0d976e5a7finclude pthread.h
Matt Wells
2013-12-15 10:40:50 -07:00
7cad5df43etry to fix core from pthreads logging msgs.
Matt Wells
2013-12-15 10:38:18 -07:00
39c8a9a1d7Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
Matt Wells
2013-12-14 15:20:04 -07:00
f9f73dae65fixed core from null json
Matt Wells
2013-12-14 15:19:52 -07:00
4fdc781a27fix spiders sticking when coll is immediately deleted after seeding a url.
Matt Wells
2013-12-14 10:52:41 -07:00
777dfb9713fix round from incrementing while spiders out.
Matt Wells
2013-12-13 20:14:34 -07:00
b7a96a0a1dminor update
Matt Wells
2013-12-13 19:47:11 -07:00
a8d03b0634more parmdb bug fixes
Matt Wells
2013-12-12 13:57:19 -08:00
463e6ced54Merge branch 'diffbot' into diffbot-testing
Matt Wells
2013-12-12 13:02:50 -08:00
abcbfa7a60Merge branch 'master' into diffbot
Matt Wells
2013-12-12 13:02:12 -08:00
7b768d4b86Merge branch 'diffbot' into diffbot-testing
Matt Wells
2013-12-12 13:01:49 -08:00
16e91375f4bring in changes from live beta from ~/github. limit spiders to 50, not 500 to prevent oom. resume killed merges that had num files shrunk even if down to one file. show collnum in spider queue. remove back-to-back whitespace, and make all space a ' ' for getting the doc checksum for deduping.
Matt Wells
2013-12-12 12:58:58 -08:00
33d4b92544Merge branch 'diffbot' into diffbot-testing
Matt Wells
2013-12-12 12:51:43 -08:00
a13114605amore parm overhaul fixes
Matt Wells
2013-12-12 12:44:54 -08:00
d85dbfb8e7do not use safebuf in thread
Matt Wells
2013-12-12 10:15:02 -07:00
76bb3d05e1clean up logging so i can see what's going on
mwells
2013-12-10 16:41:30 -08:00
db74af766bfix core in addExistingColl()
mwells
2013-12-10 15:46:38 -08:00
82494baa89move CollectionRec stuff into Collectiondb files for simplicity.
mwells
2013-12-10 15:28:04 -08:00
14b0682d6bcan't use safebuf in a thread. oops!
Matt Wells
2013-12-10 14:20:44 -07:00
22271c0bb2do not accept msg4 add requests until in sync with host 0
mwells
2013-12-10 13:20:23 -08:00
f2d5661965parmdb overhaul. support collection add/del sync when host comes back online. use udp not tcp. host #0 can now handle a new incoming request while a parm change is currently outstanding. all missed "command" parms will be received when a dead host comes back online, too, like a tight merge for instance. does not use msg4, uses msg3e and msg3f for syncing and sending parms.
mwells
2013-12-10 13:09:55 -08:00
1175478705got this new parm shit compiling
mwells
2013-12-10 12:54:19 -08:00
9e1976a8e2new parm stuff almost compiling.
mwells
2013-12-10 11:13:43 -08:00
6f6c4aed84minor admin.html edit.
Matt Wells
2013-12-10 10:39:38 -07:00
1a7d5e389bvery minor admin.html edit
Matt Wells
2013-12-10 00:56:56 -07:00
ec2254d8edadded multi language support note to admin.html
Matt Wells
2013-12-09 23:18:33 -07:00
f7e7acb398minor log msg updates. updated admin.html to give some performance and storage capacity info.
Matt Wells
2013-12-09 23:16:24 -07:00
95bd6238d9do not core when running filters when our gb home dir is really long. thanks bill! call XmlDoc::getSpiderPriority() with a SpiderReply so we can act on m_langId, like chinese, for instance, to filter those langs out from indexing. it was doing this before but got commented out for some reason.
mwells
2013-12-09 22:55:02 -07:00
cc63fd048fMerge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
mwells
2013-12-09 13:46:08 -08:00
2a5d4beec4fix core from last push.
Matt Wells
2013-12-09 14:21:46 -07:00
fa497de217remove annoying log msg
Matt Wells
2013-12-09 14:09:48 -07:00
44ae7c4de6mem labelling fixes. fixed bad alloc when generating gigabits.
Matt Wells
2013-12-09 14:05:02 -07:00
0dcd1211d3new opensource icon.
Matt Wells
2013-12-08 19:47:39 -07:00
92ec3f1148added open source icon to homepage
Matt Wells
2013-12-08 19:45:49 -07:00
92e3d841a6minor update
Matt Wells
2013-12-08 19:28:45 -07:00
12404b4f85doc updates
Matt Wells
2013-12-08 19:26:48 -07:00
dd3b49faa9collection name hell
Matt Wells
2013-12-08 16:44:37 -07:00
3353a90a85fix resuming a killed merge condition.
Matt Wells
2013-12-08 15:50:45 -07:00
ed79b67d2ecore dump fixes
Matt Wells
2013-12-08 15:36:23 -07:00
144e2c898esave resources by not doing reads on an empty doledb priority. stop saving allSpidersOn and Off parms.
Matt Wells
2013-12-08 14:07:31 -07:00
a2e52a5dc3little fix
Matt Wells
2013-12-08 10:15:54 -07:00
020d7741b9new coll.conf for main with ismedia filter. updated url filters docs some more for "isnew" and explained the errorcount stuff more.
Matt Wells
2013-12-08 10:10:51 -07:00
65e75167e3limit posdb merging to 8 files max. added some more url filters documentation.
Matt Wells
2013-12-08 09:41:05 -07:00