Commit Graph

  • 3191980f49 the new urls.csv format is ready. added url discovered time to gbssdocs so we know when we first found a url. also added to new urls.csv. fixed spiderdb list deduping so as not to discard the oldest spider request any more so we keep our discovered time in tact. Matt 2015-04-15 12:13:27 -06:00
  • f0f8f0a967 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-04-14 16:27:28 -06:00
  • 0c88ebba9b removed buggy close least used linked list logic. was causing data corruption in reads and writes. go to urgent shutdown mode if on 10th try so gb will actually exit. do not startup if there is critical data corruption. Matt Wells 2015-04-14 15:26:46 -07:00
  • 61af961dfd use m_sentToDiffbotThisTime in SpiderReply now too Matt Wells 2015-04-14 15:23:12 -07:00
  • 99454bc8ca added gbssSentToDiffbotThisTime and gbssSentToDiffbotAtSomeTime to gbss docs to clarify if the url was sent to diffbot at this crawl time, or any time. makes it easier to see what is getting processed this crawl round. Matt Wells 2015-04-14 14:50:39 -07:00
  • a92b158bc7 gbss doc oopsy fixes Matt 2015-04-13 16:47:41 -06:00
  • 040a604ec6 xmldoc back to o2 Matt 2015-04-13 14:49:44 -06:00
  • 497131d359 fix gbssdocid bug better Matt 2015-04-13 14:33:57 -06:00
  • 3e5218c54c fix gbssDocId:123456789, et al, query. will only work for docs indexed after applying this fix. Matt 2015-04-13 14:13:16 -06:00
  • 31ac1fa2b0 quick fix for spider status Matt Wells 2015-04-13 12:16:55 -07:00
  • 8d1e67be0a the diffbot objects we index as their own separate doc should inherit the hopcount from the parent html doc. Matt Wells 2015-04-13 12:12:43 -07:00
  • 4a32c8308e Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-04-13 12:08:06 -07:00
  • 614e9215cd update getSpiderStatusMsg() to always set *status. always show diffbotreply when doing crawlbottesting Matt Wells 2015-04-13 12:06:22 -07:00
  • 9f836dbf75 fix corruption of s_vbuf (gb version) in the hosts table. Matt 2015-04-13 11:13:44 -06:00
  • 4a43e1387e better fixes for core from sig alarms Matt 2015-04-13 10:28:43 -06:00
  • 48ac8bf80f fix udp linked list thing again Matt 2015-04-13 10:13:59 -06:00
  • e6696a6937 fix some more Matt 2015-04-13 10:08:01 -06:00
  • 9feb070fe9 fix issue of not being able to exit gb when a disk read retry is taking forever. Matt 2015-04-13 10:06:08 -06:00
  • f5a7423336 fix bug of never calling callback Matt 2015-04-13 09:56:21 -06:00
  • 47ed2a57ee update log msg Matt Wells 2015-04-13 07:49:57 -07:00
  • 43ced700d0 calls NEWS BLOG Matt 2015-04-12 12:33:09 -06:00
  • 2814e3db37 show screenshots Matt 2015-04-12 11:52:22 -06:00
  • 994ba73007 log more when doing crawlbottesting-* tests Matt Wells 2015-04-12 06:46:33 -07:00
  • 56a46fd294 fix printing of facets when &header=0 so diffbot json output is still simple and correct. mwells 2015-04-10 16:25:38 -06:00
  • a891fb7bdc turn on indexing spider status docs for all diffbot CRAWLS on startup, whether it was off or on before. Matt Wells 2015-04-10 14:35:41 -07:00
  • 4dce44c976 fix log msg Matt Wells 2015-04-10 13:31:46 -07:00
  • 02aa138fbb debug helper msg Matt Wells 2015-04-10 13:27:06 -07:00
  • 13d0361756 try to speed up host #4 on seraph Matt 2015-04-10 09:20:18 -06:00
  • 6b139e9eee clarify jam ups Matt Wells 2015-04-08 18:30:27 -07:00
  • e38cb8c080 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-04-08 18:08:10 -07:00
  • a7e48515fa fix core when adding gbss for a force deleted doc. Matt Wells 2015-04-08 18:07:32 -07:00
  • 64bae224e0 fix core on the GI Matt Wells 2015-04-08 16:05:32 -06:00
  • 97d3b185c1 just use INCOMING udp slots/sockets for jam detection. this will highlight the slow nodes better. Matt Wells 2015-04-08 15:52:43 -06:00
  • 7fd4310106 don't include gbss headers if not gbss documents Matt Wells 2015-04-07 22:05:11 -07:00
  • fea5210906 fix infinite loop bug mwells 2015-04-07 15:27:26 -06:00
  • 2997b3bb28 fix for skipping dead shards on tag re clookup mwells 2015-04-07 14:36:47 -06:00
  • 53a2d39afd fix for calling callback of timedout udp slots mwells 2015-04-07 14:18:42 -06:00
  • 2114c40cda fix not calling callback when udp reply times out. like for msg39 replies we need to timeout quickly. Matt Wells 2015-04-07 12:38:35 -07:00
  • 05a66cc367 fix bug of not able to get ip address because peeksize is too big. Matt Wells 2015-04-07 12:29:19 -07:00
  • b08d12a11e fix cores associated with new spider status docs. Matt Wells 2015-04-07 10:33:54 -07:00
  • 4ed8231222 upp max vfds again Matt Wells 2015-04-07 09:56:04 -06:00
  • 036fb4e0dc crap, another oopsy fix Matt 2015-04-06 14:43:58 -07:00
  • bc3335c434 more facet counting fixes Matt 2015-04-06 14:38:38 -07:00
  • 74fe3c5866 more facet counting fixes Matt 2015-04-06 14:02:59 -07:00
  • 1a262c8254 fixed oopsy Matt 2015-04-06 13:51:19 -07:00
  • 8326460e8f fix counting of # docs that have facet field. Matt Wells 2015-04-06 14:41:44 -06:00
  • 330d9a9dbf report max rounds reached, not max to process or crawl reached. Matt Wells 2015-04-06 10:24:33 -06:00
  • bffaa09599 fix for the GI Matt 2015-04-06 08:24:00 -06:00
  • de187dbb2b documentation fix Matt 2015-04-03 16:00:04 -06:00
  • 8433c49aa9 make sure we index a spider status doc for each diffbot object. that way we can tell if diffbot objects are deduping, how they are changing over time, etc. Matt 2015-04-03 14:59:09 -06:00
  • dad1cb15f4 fix excessive looping when calling makeCallbacks() on niceness 1 or above when none are available. Matt 2015-04-03 12:12:58 -06:00
  • c991a2dcdd try to ameliorate the udp slot jamming issue. Matt 2015-04-03 10:43:11 -06:00
  • c2567ad244 a hopeful fix for host #0 always crashing from streaming socket timeouts. Matt 2015-04-02 15:17:49 -06:00
  • 2ce107e4be keep track of how many times the host exited/cored as an exponent to the 'x' in the hosts table. this way we can detect hosts that have restarted many times and fix them. Matt 2015-04-01 16:28:58 -06:00
  • e583850e40 fix core when searching bogus collection. Matt 2015-04-01 15:30:59 -06:00
  • 94a8210586 added CSV to output dropdown. show all json fields for spider status doc csv files. support spider status docs in csv output. Matt 2015-04-01 13:53:03 -06:00
  • f26c9d609b one more qa test fix for spider status docs Matt 2015-04-01 12:47:32 -06:00
  • 5e46262cb2 more fixes for qa'ing of new spider status docs Matt 2015-04-01 12:03:17 -06:00
  • 10a31783bb fixes to pass internal qa tests in light of gbss (spider status doc) changes and other things. had to make xmldoc.o -O2 instead of -O3 to fix strange bug. Matt 2015-04-01 11:20:36 -06:00
  • 6b293f17e6 now show "totalDocsWithField" for each facet, so we know how many docs had that field, with any particular value, so we can do tf/idf type things. Matt 2015-04-01 09:16:42 -06:00
  • 47f6d9f414 clean out rebuild trees/buckets too mwells 2015-03-21 22:42:49 -06:00
  • e99b2f0a65 added RdbBuckets::cleanBuckets() corresponding to RdbTree::cleanTree() to remove keys from deleted collections at startup. Matt 2015-03-21 22:28:34 -06:00
  • 000c5d67e9 do not index xml docs' body for custom crawls or when indexbody is turned off. Matt 2015-03-21 09:21:44 -06:00
  • 9f42a6d5ff fix indexing of spider status docs Matt Wells 2015-03-20 18:08:39 -07:00
  • 7d82a5ca69 try to get diffbot reply info first before making spider status doc Matt Wells 2015-03-20 17:44:21 -07:00
  • 07d13541ed emergency fixes for corrupt tagdb tag id Matt Wells 2015-03-20 17:21:52 -07:00
  • 62bbede498 try to fix strange unknown tagid core Matt Wells 2015-03-20 14:20:40 -07:00
  • 7343c40e98 fix a couple status doc fields. Matt Wells 2015-03-20 12:51:36 -06:00
  • 1a80cb1b5d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing mwells 2015-03-20 12:45:24 -06:00
  • 7de9f6940b documentation for new gbss spider status doc fields. mwells 2015-03-20 12:43:21 -06:00
  • 70bdc97bfd update README.md Matt 2015-03-19 23:31:09 -06:00
  • afa14e84a1 fixes for generating spider status docs. and displaying them. mwells 2015-03-19 23:24:36 -06:00
  • fc0b3e7743 url filters fixes. Matt 2015-03-19 18:17:26 -06:00
  • ed5fe6d284 more bug fixes from 'delete' column addition to url filters. Matt 2015-03-19 18:05:36 -06:00
  • 4b1dcfa068 fix isrssext bug Matt 2015-03-19 17:28:19 -06:00
  • 90456222b6 now we add the spider status docs as json documents. so you can facet/sortby the various fields, etc. Matt 2015-03-19 16:17:36 -06:00
  • d0be9f68a7 fix iswww directive mwells 2015-03-18 09:24:50 -06:00
  • 6a1875b619 inline doc update mwells 2015-03-17 21:50:10 -06:00
  • d5560c3e77 final fix for new delete column mwells 2015-03-17 21:47:31 -06:00
  • c9b14b1b89 fix 'delete' checkbox in url filters. fix reading in of xml conf files that have </> tags. Matt 2015-03-17 21:20:27 -06:00
  • dfc069aaa1 do away with filtered/banned spider priorities. add checkbox to signify force deletes to remove urls from index if in the index, or not allow them in. mwells 2015-03-17 20:27:23 -06:00
  • dea534827e langidbits init bug leftover from searchinput reset memset fix i think. Matt 2015-03-17 15:04:31 -06:00
  • 5b9aa0b0a5 added isroot url filter. mwells 2015-03-17 14:52:04 -06:00
  • ebaaaeeef3 Merge branch 'testing' into diffbot-testing mwells 2015-03-17 14:41:33 -06:00
  • 384761d4b5 fix build mwells 2015-03-17 14:40:43 -06:00
  • f830eb43f7 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-03-17 14:31:33 -06:00
  • a54471849b sitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted. Matt 2015-03-17 14:26:16 -06:00
  • 29b2707ad7 fix searchinput::clear() bug. final fix for fhtqt memleak bug. Matt Wells 2015-03-15 07:48:29 -07:00
  • 3b39b1d37a fix facet mem leak from QueryTerm::m_facetHashTable and safebuf when doing federated queries over a token. Matt Wells 2015-03-15 07:18:32 -07:00
  • 427fae7135 fix log spam Matt Wells 2015-03-12 22:31:40 -07:00
  • e71af6d26c recompute active list every 3 secs. otherwise it seems buggy and drops collections it shouldn't. Matt Wells 2015-03-12 22:22:19 -07:00
  • 83be5d7d46 fix links parser so it harvests outlinks from rss feeds' <link> tags. it was doing this before, now it is doing it again. Matt 2015-03-12 17:35:47 -07:00
  • 48435c55b1 Revert "fix json search results formatting." Matt Wells 2015-03-12 15:42:20 -07:00
  • 7879537ab6 fix json search results formatting. Matt Wells 2015-03-12 14:25:37 -07:00
  • 5d2a9d6d8c Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-03-12 14:00:03 -07:00
  • 89b61d95a8 fix fix Matt Wells 2015-03-12 13:59:42 -07:00
  • 3c2b082540 gbfacetstr: is case-sensitive. Matt Wells 2015-03-12 13:54:11 -07:00
  • a8dfa56098 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-03-12 13:19:16 -07:00
  • 485c600c7c fix for changing maxtocrawl/process/rounds. fix for thinking a crawl is done when it is just taking a while to populate doledb from the waiting tree for that SpiderColl. we just call populateDoledbFromWaiting in doneSleepingWrapper avery 50ms. it loops over every coll so it could be more efficient. Matt Wells 2015-03-12 13:15:52 -07:00
  • a4e95899bd makefile updates for building pkgs Matt 2015-03-10 21:43:37 -07:00